RESEARCH : BIG DATA ANALYSIS – DATASETS (Assignment 6)

 

BIG DATA ASSIGNMENT 6

by Wirawan Rizkika 1401140469

 

data set (or dataset, although this spelling is not present in many contemporary dictionaries) is a collection of data.

Most commonly a data set corresponds to the contents of a single database table, or a single statistical data matrix, where every column of the table represents a particular variable, and each row corresponds to a given member of the data set in question. The data set lists values for each of the variables, such as height and weight of an object, for each member of the data set. Each value is known as a datum. The data set may comprise data for one or more members, corresponding to the number of rows.

Analysis is the process of breaking a complex topic or substance into smaller parts in order to gain a better understanding of it.

Tools Used:

Tools used for this analysis is Gephi (V. 0.9.1 – Latest version). I prefer to use gephi because of the ease-of-use of the software, and it is very friendly for first timer. Gephi can visualize the dataset properly. Compared to more advanced visualizer software such as Cytoscape, Gephi is a smaller, simpler, perhaps less “mature” project than Cytoscape. Therefore, if you just want a quick and dirty network visualizer to get a feel for the basics, Gephi may be a good place to start, because it is less complicated and not as confusing the first time you use it.

 Data Source

This dataset contains 236 Nodes and 5899 Edges.

 I got the Datasets through http://www.sociopatterns.org, This dataset is mainly talking about the relationship or interactions between students, teachers. The dataset comprises two weighted networks of  face-to-face proximity between students and teachers. Nodes are individuals and edges represent face-to-face interactions.

Nodes have two attributes:

  • classnamethat indicates the school class and grade of the corresponding individual, and gender.

Teachers are all assigned to the “Teachers” class. Edges between A and B have two weights associated with them:

 

  • duration, which is the cumulative time spent by A and B in face-to-face proximity, over one day, measured in seconds (multiples of 20 seconds); and count, which is the number of times the A-B contact was established during the school day.

 

Nodes are consist of:

  • Students/Teachers ID
  • Students/Teachers Label
  • Classname
  • Gender

 

Edges are consist of:

Number of relations/path that are created between each nodes. Since the data source comes from a .csv file, all of the Nodes and Edges are listed in the Data Laboratory. We can easily manages the data.
*dataset files used :

Screen Shot 2016-10-01 at 8.23.37 PM.png

Edges Table before Statistics Run:

Screen Shot 2016-10-01 at 8.33.19 PM.png

 

Nodes Table before statistics Run:

screen-shot-2016-10-01-at-8-33-11-pm

 

Edges Table after Statistics Run:

screen-shot-2016-10-01-at-8-36-50-pm

 

Nodes Table after Statistics Run:

screen-shot-2016-10-01-at-8-37-04-pm

 

 

  • Eigenvector centrality is number that will ranked the nodes which are the most connected to the nodes.
  • Modularity class will show “the strength of division of a network into modules (also called groups, clusters or communities)”
  • The weighted degree of a node is like the degree. It’s based on the number of edge for a node, but ponderated by the weigtht of each edge. It’s doing the sum of the weight of the edges.
  • Betweenness centrality is a measure based on the number of shortest paths between any two nodes that pass through a particular node. Nodes around the edge of the network would typically have a low betweenness centrality. A high betweenness centrality might suggest that the individual is connecting various different parts of the network together.
  • Closeness centrality is a measure that indicates how close a node is to all the other nodes in a network, whether or not the node lays on a shortest path between other nodes. A high closeness centrality means that there is a large average distance to other nodes in the network. (So a small closeness centrality means there is a short average distance to all other nodes in the network.)
  • The eccentricity measure captures the distance between a node and the node that is furthest from it; so a high eccentricity means that the furthest away node in the network is a long way away, and a low eccentricity means that the furthest away node is actually quite close.
  • PageRank An iterative algorithm that measures the importance of each node within the network. The metric assigns each node a probability.

 

Dataset Visualization

screen-shot-2016-10-01-at-8-06-58-pm

*Whole datasets visualized

Software Analysis

  • Degree

screen-shot-2016-10-01-at-9-15-08-pm

Nodes that has the most connected, visualized using Fruchterman Reingold Layout. The RED coloured nodes are the nodes that has more edges than the green to blue colours. As we can see here, the most connected Nodes is ID 1551.

  • Degree

screen-shot-2016-10-01-at-9-20-11-pm

The picture above also shows the same results. This shows the most connected edges, but the nodes are classified or placed or sorted by its class, it will be easier to see the nodes classname.

  • Classname

screen-shot-2016-10-01-at-9-25-30-pm

screen-shot-2016-10-01-at-9-27-52-pm

The colour of this visualization is based on the classname. This graph is visualized by using Force-Atlas 2 layout, to disperse groups and give space around larger nodes.

  • Gender

screen-shot-2016-10-01-at-9-30-09-pm

screen-shot-2016-10-01-at-9-30-19-pm

This graph show the edges and nodes based on the gender of the nodes.

  • Page Rank

screen-shot-2016-10-01-at-9-33-19-pm

This graph shows the Page Rank. The highest ranks has dark blue colour. The highest rank has the most important nodes within the network.

  • Weighted Degree

screen-shot-2016-10-01-at-9-37-29-pm

This one shows which of the edges has the highest weighted degrees. Since all of the node has the same weight (1.0) so the results are the same with the degree.

  • Closeness Centrality

screen-shot-2016-10-01-at-9-41-01-pm

This graph show the highest closeness centrality, which is 0.63172. Means that node 1551 has the closest relationship between each nodes. Node 1551 has more influential factors because most other nodes must have known 1551. 1551 can spread informations without a intermediary.

  • Betweenness Centrality

screen-shot-2016-10-01-at-9-47-14-pm

This graph is based on Betweeness Centrality. This measure shows the role of a node becomes a bottleneck. Node be important if it becomes a communication bottleneck. The analogy, consider the intersection as a node. More and more roads must pass through the intersection (ie there is no alternative way), it is increasingly important meaning that intersection. If at the intersection traffic lights off, it can be fatal because the flow of cars (information) will be hampered. They can also be used to identify the boundary spanners, that person or nodes that serve as a connection (bridge) between the two communities. Betweenness centrality of a node is calculated by summing all the shortest path containing the node. Nodes around the edge of the network would typically have a low betweenness centrality(Node 1545, Pink). A high betweenness centrality (Node 1551, Black) might suggest that the individual is connecting various different parts of the network together. Node 1551 are the intermediary or broker nodes, that connects to different clusters.

  • Modularity Class

screen-shot-2016-10-01-at-9-25-30-pm

screen-shot-2016-10-01-at-9-52-37-pm

This graph shows the Modularity or the communities of the nodes. The results shown that the biggest community is class 2, 29.24%.

  • Eccentricity

 

This graph shows the eccentricity of the node, which captures the distance between a node and the node that is furthest from it; so a high eccentricity (1545) means that the furthest away node in the network is a long way away, and a low eccentricity means that the furthest away node is actually quite close. The connection or edges of Node 1545 are mostly far from the node 1545.

  • Clustering Coefficient

 

screen-shot-2016-10-01-at-10-04-19-pm

 

This graph show the clustering coefficient,  when applied to a single node, is a measure of how complete the neighborhood of a node is. The highest clustering coefficient value in the network is 1545. Node 1545 has more connections in the same community or groups, compared to the other node in the same community.

 

  •  Eigenvector centrality

 

eigenvector

 

The graph show the node that has the most edges connection compared to the other node is 1761 which results is 1.0 eigenvector centrality.

  • Number Of Triangle

screen-shot-2016-10-01-at-10-11-17-pm

Nodes number 1761 has the most number of triangle, which accounted until 1616 triangles.

*for most of the graph, I uses Force-Atlas 2 layout, to disperse the network, so we can see the graph more clearly and accurately.

 

Conclusion :

 

The Gephi has been a useful visualization tools for people to visualize their data. From this casem we know that Node 1551, which is a MALE students, Class 3B, has the most edges in the networks, which accounted 98 degree. He also has the furuthest average distance than the others.

screen-shot-2016-10-01-at-10-18-46-pm

 

References:

 

https://gephi.org/tutorials/gephi-tutorial-visualization.pdf

http://www.martingrandjean.ch/introduction-to-network-visualization-gephi/

https://blog.ouseful.info/2010/05/10/getting-started-with-gephi-network-visualisation-app-–-my-facebook-network-part-iii-ego-filters-and-simple-network-stats/

http://www.slideshare.net/gephi/gephi-quick-start

www.sociopatterns.org

https://en.wikipedia.org/wiki/Analysis

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s