StORMeD: Stack Overflow Ready Made Data

The contents of a Stack Overflow discussion are inherently heterogeneous, mixing natural language, source code (i.e., java), stack traces and configuration files in XML or JSON format. Modeling such heterogenous contents is not a trivial task and requires full fledged heterogeneous island grammar.

Bored of building your own island parser? Cannot afford enough computational power to perform your analyses on Stack Overflow? No worries, these times are over.

StORMeD is created from a subset of the Stack Exchange Data Dump available at The Internet Archive, and it is updated to June 2017.
Would you like to use our dataset and our development kit? Your are welcome to cite us!

@inproceedings{Ponz2015a,
  Author = {Luca Ponzanelli and Andrea Mocci and Michele Lanza},
  Title = {StORMeD: Stack Overflow Ready Made Data},
  Booktitle = {Proceedings of MSR 2015 (12th Working Conference on Mining Software Repositories)},
  Pages = {474-477},
  Publisher = {ACM Press},
  Year = {2015}
}

Do you want to know more? Take a look to our paper!
Luca Ponzanelli, Andrea Mocci, and Michele Lanza: StORMeD: Stack Overflow Ready Made Data

StORMeD is a dataset that models the heterogeneous contents of each discussion tagged with the <java> tag. For each discussion in the dataset, StORMeD provides a Heterogeneous Abstract Syntax Tree (H-AST) that allows to navigate XML, JSON, stack traces, and Java code together with natural language. StORMeD is available in the JSON format.

StORMeD is not backward compatible either for the dataset and the service. If your project is using an older version (e.g., < 2.0.0) you need to upgrade your devkit or unxpected serialization failures might happen.

JSON Dataset Updated to June 2017

StORMed also provide a Scala API to analyze your data without necessarily having to import the dataset in a database. The API provides all the H-AST nodes used to generate the JSON files in the dataset. There is a Scaladoc available describing all the case classes used in the model.

The StORMeD DevKit can be used in any Maven project by adding the following repository and dependency to your pom file.

<repository>
	<id>stormed</id>
	<name>StORMeD Dev-Kit Repository</name>
	<url>https://anonymous:anonymous@stormed.inf.usi.ch/releases/</url>
</repository>

<dependency>
	<groupId>ch.usi.inf.reveal.parsing</groupId>
	<artifactId>stormed-devkit_2.12</artifactId>
	<version>2.0.0</version>
</dependency>

To use the StORMeD DevKit into any SBT project just add the following into your build.sbt file:

resolvers += "StORMeD Dev-Kit Repository" at "https://stormed.inf.usi.ch/releases/"
credentials += Credentials("Sonatype Nexus Repository Manager", "stormed.inf.usi.ch", "anonymous", "anonymous")
libraryDependencies += "ch.usi.inf.reveal.parsing" %% "stormed-devkit" % "2.0.0"

Deserializing a JSON file

import ch.usi.inf.reveal.parsing.artifact._
val jsonFilePath = /*path to json file*/
val artifact = ArtifactSerializer.deserializeFromFile(jsonFilePath)

StORMeD implements an utility object ArtifactSerializer that allows to deserialize a JSON file, or JSON String to an StackOverflowArtifact. With these three simple lines, you can already start playing with stack overflow objects.

Visiting the H-AST with the new API

The StORMeD development kit allows to programmatically visit the H-AST. By default, several visitor to perform node collection are available in the development kit. In details, it is possible to use IdentifierNodeVisitor, TypeNodeVisitor, MethodInvocationNodeVisitor, MethodDeclaratorNodeVisitor, ImportDeclarationNodeVisitor, VariableDeclaratorVisitor, ClassDeclaratorNodeVisitor, InterfaceDeclaratorNodeVisitor, and EnumDeclaratorNodeVisitor. These visitors allow to visit and collect nodes from an H-AST, by also performing filtering on what to collect. The usage is straightforward. An example is reported below:

//Instatiates a visitor collecting nodes in a List
val listVisitor = IdentifierNodeVisitor.list() 
//a list of IdentifierNode collect by visiting the artifact
val collected = listVisitor(List(), artifact)

In the example above, listVisitor collects nodes without applying any filtering, and by beginning with an empty collector (List()). The built-in visitors allow to apply a filtering function to exclude, for example, IdentifierNode defined within methods. This filtering process can handles by defining a simple filtering function that excludes method declarators.

def excludeMethodDeclarators(element: Visitable) = element match{
  case node:MethodDeclaratorNode => false
  case _ => true
}

//Instatiates a visitor collecting nodes in a List
val listVisitor = IdentifierNodeVisitor.list(excludeMethodDeclarators) 
//a list of IdentifierNode collected outside methods
val collected = listVisitor(List(), artifact)

The method excludeMethodDeclarators filters out all MethodDeclaratorNode elements from the visit, thus excluding them from the visit. The filtering function can be passed to the visitor at instantiation time as parameter of the method list, or other instantiation types like set, to collect only distinct elments, or seq to collect elments in a generic collection. Every filtering like excludeMethodDeclarators are of type Visitable => Boolean.

Devising Custom Collectors

The built-in visitors do not cover all the type of H-AST nodes provided in StORMeD. One may need to collect other types of nodes, for example, comments. The devlopment kit allows to easily extend the set of built-in collectors by extending the NodeAccumulatorVisitor in the following way:

object Extractors {
  def commentExtractor = ((element: Visitable) => {
      element match {
        case comment: CommentNode =>  Seq(comment)
      }
  })        
}

object CommentNodeVisitor extends NodeAccumulatorVisitor[CommentNode](Extractors.commentExtractor)

the method commentExtractor implements a partial function that just selects the node of interest, that is, the CommentNode. Once devised, the CommentNodeVisitor can be used as previously explained for the IdentifierNodeVisitor

Too lazy to read an entire Stack Overflow discussion? No problem, there is a way to make it shorter.

Insert a valid Stack Overflow URL or a discussion Id.

About the Summarizer

Every discussion in the StORMeD dataset can be viewed and shrinked. The summarized is interactive and the user can choose the percentage of the the discussion to show by using the slider in the top-right corner of the discussion.

The paper below provides details about the approach, where island parsing and classical textual summarization approaches are mixed to deal with code elements.

@inproceedings{Ponz2015b,
  Author = {Luca Ponzanelli and Andrea Mocci and Michele Lanza},
  Title = {Summarizing Complex Development Artifacts by Mining Heterogenous Data},
  Booktitle = {Proceedings of MSR 2015 (12th Working Conference on Mining Software Repositories)},
  Pages = {401-405},
  Publisher = {ACM Press},
  Year = {2015}}

Do you need the island parsing technology behind StORMeD?

Now you have the chance to try it out. For Free.

Service Registration

To use the service a simple registration with a valid email address is needed. You can register here to obtain a valid key. Every registered user is assigned of a daily quota of 1000 requests. The quota gets renewed on midnight. The registration phase is needed to identify users and avoid abuse. The information gathered during the registration are considered confidential and will not be disclosed to third parties.

Only one key per email address can be released. Please remember that the key is personal.

Key Recovery

Did you forget or lost your key? No worries, you can recover it. All you need to do is to access the Recovery Page and follow the instructions.

Service Methods

In the current version, the StORMeD API provides one single service parse. Every method requires a valid service key and the appropriate JSON with the correct parameters as POST request.

The table below contains all the methods provided by the service, with the input paramters and the service output.

Method	Parameters	Output
/parse	`text`: a string representing the text to be parsed by the service. `key`: a string representing a valid service key for a registered user.	`status` `OK` or `ERROR`, always present. if status equals `ERROR` `message`: error message. if status equals `OK` `quotaRemaining`: the requests quota left. `result`: a list of HASTNode objects resulting from the parsing.
/tagger	`text`: a string representing the text to be tagged by the service. `tagged`: if set to false `text` is consisdered plain text. If set to true, `text` is considered a tagged input: `<code>` tags are preserved, while untagged code within other tags will tagged with additional `<code>` tags. `key`: a string representing a valid service key for a registered user.	`status` `OK` or `ERROR`, always present. if status equals `ERROR` `message`: error message. if status equals `OK` `quotaRemaining`: the requests quota left. `result`: a String containing the contents sent with the request, and code elements tagged within <code> tags.

Using the service in Java and Scala

As any other REST service, the StORMeD service can be used programmatically.

The StORMeD DevKit can be used in any Maven project by adding the following repository and dependency to your pom file.

<repository>
	<id>stormed</id>
	<name>StORMeD Dev-Kit Repository</name>
	<url>https://anonymous:anonymous@stormed.inf.usi.ch/releases/</url>
</repository>

<dependency>
	<groupId>ch.usi.inf.reveal.parsing</groupId>
	<artifactId>stormed-devkit_2.12</artifactId>
	<version>2.0.0</version>
</dependency>

To use the StORMeD DevKit into any SBT project just add the following into your build.sbt file:

resolvers += "StORMeD Dev-Kit Repository" at "https://stormed.inf.usi.ch/releases/"
credentials += Credentials("Sonatype Nexus Repository Manager", "stormed.inf.usi.ch", "anonymous", "anonymous")
libraryDependencies += "ch.usi.inf.reveal.parsing" %% "stormed-devkit" % "2.0.0"

Once your project is set up, interacting with the service API is fairly simple. As a proof of concept, a StORMeD Client is available to download as both Maven and SBT project. In the StORMeD Client example there is a simple Scala wrapper for the current Service methods. Part of the object StormedService is reported below shows how to easily implement a wrapper for the service. The full-fledged code is available in the project.

In the project there is a sample main program to invoke the StormedService wrapper. With few lines of Scala it is possible to perform a parsing request.

object StormedClientExample extends App {
 		
  val codeToParse = """
    Lorem ipsum dolor sit amet, consectetur adipiscing elit
    public static void main(int args[])
    Proin tincidunt tristique ante, sed lacinia leo fermentum quis.
    Fusce in magna eu ante tincidunt euismod nec eu ligula.
    List<Integer> someList;
    """.trim

  
  val result = StormedService.parse(codeToParse)
  result match {
    case ParsingResponse(result, quota, status) =>
      println(s"Status: $status")
      println(s"Quota Remaining: $quota")
      val nodeTypes = result.map{_.getClass.getSimpleName}
      println("Parsing Result: ")
      nodeTypes.foreach{println}
    case ErrorResponse(message, status) =>
      println(status + ": " + message)
  }
}

The object StormedService can easily invoked from Java with little effort. The only aspect to take into account concerns Scala case classes. This sample code shows how to deal with Scala classes from Java and how some of them can be easily converted and used with Java.

public class StormedClientJavaExample {

  public static void main(final String args[]) {
    final String codeToParse = 
        "Lorem ipsum dolor sit amet, consectetur adipiscing elit\n"+
        "public static void main(int args[])\n"+
        "Proin tincidunt tristique ante, sed lacinia leo fermentum quis.\n"+
        "Fusce in magna eu ante tincidunt euismod nec eu ligula.\n"+
        "List&lt;Integer&gt; someList;";


    final Response response = StormedService.parse(codeToParse);
    
    /*Whenever accessing fields of Scala case classes, 
     *getters are always in the form variableName.fieldName();
     */
    switch(response.status()) {
    case "OK":
      final ParsingResponse success = (ParsingResponse) response;
      System.out.println("Status: " + success.status());
      System.out.println("Quota Remaining: " + success.quotaRemaining());
      System.out.println("Parsing Result: ");
      printHASTNodes(success.result());
      break;
    case "ERROR":
      final ErrorResponse error = (ErrorResponse) response;
      System.out.println(error.status() +": " + error.message());
      break;
    }

  }
  
  /* All HASTNode carrying a Scala collection can be easily converted to
   * a Java collection by using Scala' SDK conversion for java.
   * For matter of convenience, just statically import asJavaCollection 
   * as above and use it to convert Seq to Collection as below.
   */
  private static void printHASTNodes(final Seq&lt;HASTNode&gt; result) {
    final Collection&lt;HASTNode&gt; collection = asJavaCollection(result);
    for(final HASTNode node : collection){
      System.out.println(node.getClass().getSimpleName());
    }
  }
}

Using the service in Python

The service can be programmatically used with anly language that supports json deserialization and REST requests. As a proof of concept, Python can be used to replicate the same service usage performed in Scala and Java.

Requirements

Python can be easily used without the StORMeD Development Kit, but requires one external dependency: requests

Install requests via easy install $ easy_install requests or pip $ pip install requests in your temrminal.

Deserializing JSON

There is no development kit for Python. However, any object served by the service can be easily deserialzed and used in Python or any prototype-based language. First of all, the packages requests and json must be imported, and an ObjectWrapper class must be defined to wrap any single JSON object that will be deserialized.

import json
import requests

class ObjectWrapper(object):
  def __init__(self, obj): 
	self.obj = obj
	
  def __getattr__(self,name):
    return self.obj[name]


def object_decoder(obj):
  if isinstance(obj,dict):
    return ObjectWrapper(obj)
  else:
    return obj

Every JSON object in Python is converted to a dictionary. The easiest way to convert these dictionary to actual object is to use the object_decoder function to wrap non-temrinal object (i.e., lists) to an instance of ObjectWrapper. The object_decoder function is used as object_hook by the json library.

data = {
  'text': "void main(final String[] args) <== this is the main method.",
  'key': "<your API Key>"
}


url = "https://stormed.inf.usi.ch/service/parse"
headers = {'Content-type': 'application/json', 'Accept': 'text/plain'}
response = requests.post(url, data=json.dumps(data), headers=headers, verify=False)
wrapper = json.loads(response.text, object_hook=object_decoder)
if wrapper.status == "OK":
  print "Status: " + wrapper.status
  print "Quota Remaining: " + str(wrapper.quotaRemaining)
  print "Parsing Result: "
  for node in wrapper.result:
    print node.type	
else:
  print wrapper.status + ": " + wrapper.message

The code performs a POST request to the method parse of the service, and replicates what is previously done by the Java and Scala samples.
The full Python code is available here.

StORMeD

Attribution

The Dataset

The StORMeD DevKit