Stack Overflow Ready Made Data
The contents of a Stack Overflow discussion are inherently heterogeneous, mixing natural language, source code (i.e., java), stack traces and configuration files in XML or JSON format. Modeling such heterogenous contents is not a trivial task and requires full fledged heterogeneous island grammar.
Bored of building your own island parser? Cannot afford enough computational power to perform your analyses on Stack Overflow? No worries, these times are over.
StORMeD is created from a subset of the Stack Exchange Data Dump available at The Internet Archive, and it is updated to June 2017.
Would you like to use our dataset and our development kit?
Your are welcome to cite us!
@inproceedings{Ponz2015a, Author = {Luca Ponzanelli and Andrea Mocci and Michele Lanza}, Title = {StORMeD: Stack Overflow Ready Made Data}, Booktitle = {Proceedings of MSR 2015 (12th Working Conference on Mining Software Repositories)}, Pages = {474-477}, Publisher = {ACM Press}, Year = {2015} }
Do you want to know more? Take a look to our paper!
Luca Ponzanelli, Andrea Mocci, and Michele Lanza: StORMeD: Stack Overflow Ready Made Data
StORMeD is a dataset that models the heterogeneous contents of each discussion tagged with the <java>
tag. For each discussion in the dataset, StORMeD provides a Heterogeneous Abstract Syntax Tree (H-AST) that allows to navigate XML, JSON, stack traces, and Java code together with natural language.
StORMeD is available in the JSON format.
StORMeD is not backward compatible either for the dataset and the service. If your project is using an older version (e.g., < 2.0.0) you need to upgrade your devkit or unxpected serialization failures might happen.
StORMed also provide a Scala API to analyze your data without necessarily having to import the dataset in a database. The API provides all the H-AST nodes used to generate the JSON files in the dataset. There is a Scaladoc available describing all the case classes used in the model.
The StORMeD DevKit can be used in any Maven project by adding the following repository and dependency to your pom file.
<repository> <id>stormed</id> <name>StORMeD Dev-Kit Repository</name> <url>https://anonymous:anonymous@stormed.inf.usi.ch/releases/</url> </repository>
<dependency> <groupId>ch.usi.inf.reveal.parsing</groupId> <artifactId>stormed-devkit_2.12</artifactId> <version>2.0.0</version> </dependency>
To use the StORMeD DevKit into any SBT project just add the following into your build.sbt
file:
resolvers += "StORMeD Dev-Kit Repository" at "https://stormed.inf.usi.ch/releases/" credentials += Credentials("Sonatype Nexus Repository Manager", "stormed.inf.usi.ch", "anonymous", "anonymous") libraryDependencies += "ch.usi.inf.reveal.parsing" %% "stormed-devkit" % "2.0.0"
import ch.usi.inf.reveal.parsing.artifact._ val jsonFilePath = /*path to json file*/ val artifact = ArtifactSerializer.deserializeFromFile(jsonFilePath)
StORMeD implements an utility object ArtifactSerializer
that allows to deserialize a JSON file, or JSON String to an StackOverflowArtifact
. With these three simple lines, you can already start playing with stack overflow objects.
The StORMeD development kit allows to programmatically visit the H-AST. By default, several visitor to perform node collection are available in the development kit. In details, it is possible to use IdentifierNodeVisitor
, TypeNodeVisitor
, MethodInvocationNodeVisitor
, MethodDeclaratorNodeVisitor
, ImportDeclarationNodeVisitor
, VariableDeclaratorVisitor
, ClassDeclaratorNodeVisitor
, InterfaceDeclaratorNodeVisitor
, and EnumDeclaratorNodeVisitor
. These visitors allow to visit and collect nodes from an H-AST, by also performing filtering on what to collect. The usage is straightforward. An example is reported below:
//Instatiates a visitor collecting nodes in a List val listVisitor = IdentifierNodeVisitor.list() //a list of IdentifierNode collect by visiting the artifact val collected = listVisitor(List(), artifact)
In the example above, listVisitor
collects nodes without applying any filtering, and by beginning with an empty collector (List()
). The built-in visitors allow to apply a filtering function to exclude, for example, IdentifierNode
defined within methods. This filtering process can handles by defining a simple filtering function that excludes method declarators.
def excludeMethodDeclarators(element: Visitable) = element match{ case node:MethodDeclaratorNode => false case _ => true } //Instatiates a visitor collecting nodes in a List val listVisitor = IdentifierNodeVisitor.list(excludeMethodDeclarators) //a list of IdentifierNode collected outside methods val collected = listVisitor(List(), artifact)
The method excludeMethodDeclarators
filters out all MethodDeclaratorNode
elements from the visit, thus excluding them from the visit. The filtering function can be passed to the visitor at instantiation time as parameter of the method list
, or other instantiation types like set
, to collect only distinct elments, or seq
to collect elments in a generic collection. Every filtering like excludeMethodDeclarators
are of type Visitable => Boolean
.
The built-in visitors do not cover all the type of H-AST nodes provided in StORMeD. One may need to collect other types of nodes, for example, comments. The devlopment kit allows to easily extend the set of built-in collectors by extending the NodeAccumulatorVisitor
in the following way:
object Extractors { def commentExtractor = ((element: Visitable) => { element match { case comment: CommentNode => Seq(comment) } }) } object CommentNodeVisitor extends NodeAccumulatorVisitor[CommentNode](Extractors.commentExtractor)
the method commentExtractor
implements a partial function that just selects the node of interest, that is, the CommentNode
. Once devised, the CommentNodeVisitor
can be used as previously explained for the IdentifierNodeVisitor
Too lazy to read an entire Stack Overflow discussion? No problem, there is a way to make it shorter.
Every discussion in the StORMeD dataset can be viewed and shrinked. The summarized is interactive and the user can choose the percentage of the the discussion to show by using the slider in the top-right corner of the discussion.
The paper below provides details about the approach, where island parsing and classical textual summarization approaches are mixed to deal with code elements.
@inproceedings{Ponz2015b, Author = {Luca Ponzanelli and Andrea Mocci and Michele Lanza}, Title = {Summarizing Complex Development Artifacts by Mining Heterogenous Data}, Booktitle = {Proceedings of MSR 2015 (12th Working Conference on Mining Software Repositories)}, Pages = {401-405}, Publisher = {ACM Press}, Year = {2015}}
Do you need the island parsing technology behind StORMeD?
Now you have the chance to try it out. For Free.
To use the service a simple registration with a valid email address is needed. You can register here to obtain a valid key. Every registered user is assigned of a daily quota of 1000 requests. The quota gets renewed on midnight. The registration phase is needed to identify users and avoid abuse. The information gathered during the registration are considered confidential and will not be disclosed to third parties.
Only one key per email address can be released. Please remember that the key is personal.
Did you forget or lost your key? No worries, you can recover it. All you need to do is to access the Recovery Page and follow the instructions.
In the current version, the StORMeD API provides one single service parse. Every method requires a valid service key and the appropriate JSON with the correct parameters as POST request.
The table below contains all the methods provided by the service, with the input paramters and the service output.
Method | Parameters | Output |
---|---|---|
/parse |
text : a string representing the text to be parsed by the service.key : a string representing a valid service key for a registered user. |
status OK or ERROR, always present.if status equals ERROR
message : error message.if status equals OK
quotaRemaining : the requests quota left.result : a list of HASTNode objects resulting from the parsing. |
/tagger |
text : a string representing the text to be tagged by the service.tagged : if set to false text is consisdered plain text.
If set to true, text is considered a tagged input: <code> tags are preserved, while untagged code within other tags will tagged with additional <code> tags.key : a string representing a valid service key for a registered user. |
status OK or ERROR, always present.if status equals ERROR
message : error message.if status equals OK
quotaRemaining : the requests quota left.result : a String containing the contents sent with the request, and code elements tagged within <code> tags. |
As any other REST service, the StORMeD service can be used programmatically.
StORMeD is not backward compatible either for the dataset and the service. If your project is using an older version (e.g., < 2.0.0) you need to upgrade your devkit or unxpected serialization failrues might happen.
The StORMeD DevKit can be used in any Maven project by adding the following repository and dependency to your pom file.
<repository> <id>stormed</id> <name>StORMeD Dev-Kit Repository</name> <url>https://anonymous:anonymous@stormed.inf.usi.ch/releases/</url> </repository>
<dependency> <groupId>ch.usi.inf.reveal.parsing</groupId> <artifactId>stormed-devkit_2.12</artifactId> <version>2.0.0</version> </dependency>
To use the StORMeD DevKit into any SBT project just add the following into your build.sbt
file:
resolvers += "StORMeD Dev-Kit Repository" at "https://stormed.inf.usi.ch/releases/" credentials += Credentials("Sonatype Nexus Repository Manager", "stormed.inf.usi.ch", "anonymous", "anonymous") libraryDependencies += "ch.usi.inf.reveal.parsing" %% "stormed-devkit" % "2.0.0"
Once your project is set up, interacting with the service API is fairly simple. As a proof of concept, a StORMeD Client is available to download as both Maven and SBT project.
In the StORMeD Client example there is a simple Scala wrapper for the current Service methods. Part of the object StormedService
is reported below shows how to easily implement a wrapper for the service. The full-fledged code is available in the project.
In the project there is a sample main program to invoke the StormedService
wrapper. With few lines of Scala it is possible to perform a parsing request.
object StormedClientExample extends App { val codeToParse = """ Lorem ipsum dolor sit amet, consectetur adipiscing elit public static void main(int args[]) Proin tincidunt tristique ante, sed lacinia leo fermentum quis. Fusce in magna eu ante tincidunt euismod nec eu ligula. List<Integer> someList; """.trim val result = StormedService.parse(codeToParse) result match { case ParsingResponse(result, quota, status) => println(s"Status: $status") println(s"Quota Remaining: $quota") val nodeTypes = result.map{_.getClass.getSimpleName} println("Parsing Result: ") nodeTypes.foreach{println} case ErrorResponse(message, status) => println(status + ": " + message) } }
The object StormedService
can easily invoked from Java with little effort. The only aspect to take into account concerns Scala case classes. This sample code shows how to deal with Scala classes from Java and how some of them can be easily converted and used with Java.
public class StormedClientJavaExample { public static void main(final String args[]) { final String codeToParse = "Lorem ipsum dolor sit amet, consectetur adipiscing elit\n"+ "public static void main(int args[])\n"+ "Proin tincidunt tristique ante, sed lacinia leo fermentum quis.\n"+ "Fusce in magna eu ante tincidunt euismod nec eu ligula.\n"+ "List<Integer> someList;"; final Response response = StormedService.parse(codeToParse); /*Whenever accessing fields of Scala case classes, *getters are always in the form variableName.fieldName(); */ switch(response.status()) { case "OK": final ParsingResponse success = (ParsingResponse) response; System.out.println("Status: " + success.status()); System.out.println("Quota Remaining: " + success.quotaRemaining()); System.out.println("Parsing Result: "); printHASTNodes(success.result()); break; case "ERROR": final ErrorResponse error = (ErrorResponse) response; System.out.println(error.status() +": " + error.message()); break; } } /* All HASTNode carrying a Scala collection can be easily converted to * a Java collection by using Scala' SDK conversion for java. * For matter of convenience, just statically import asJavaCollection * as above and use it to convert Seq to Collection as below. */ private static void printHASTNodes(final Seq<HASTNode> result) { final Collection<HASTNode> collection = asJavaCollection(result); for(final HASTNode node : collection){ System.out.println(node.getClass().getSimpleName()); } } }
The service can be programmatically used with anly language that supports json deserialization and REST requests. As a proof of concept, Python can be used to replicate the same service usage performed in Scala and Java.
Python can be easily used without the StORMeD Development Kit, but requires one external dependency: requests
Install requests via easy install$ easy_install requests
or pip $ pip install requests
in your temrminal.
requests
and json
must be imported, and an
ObjectWrapper
class must be defined to wrap any single JSON object that will be deserialized.
import json import requests class ObjectWrapper(object): def __init__(self, obj): self.obj = obj def __getattr__(self,name): return self.obj[name] def object_decoder(obj): if isinstance(obj,dict): return ObjectWrapper(obj) else: return obj
Every JSON object in Python is converted to a dictionary. The easiest way to convert these dictionary to actual object is to use the object_decoder
function
to wrap non-temrinal object (i.e., lists) to an instance of ObjectWrapper
. The object_decoder
function is used as object_hook
by the json
library.
data = { 'text': "void main(final String[] args) <== this is the main method.", 'key': "<your API Key>" } url = "https://stormed.inf.usi.ch/service/parse" headers = {'Content-type': 'application/json', 'Accept': 'text/plain'} response = requests.post(url, data=json.dumps(data), headers=headers, verify=False) wrapper = json.loads(response.text, object_hook=object_decoder) if wrapper.status == "OK": print "Status: " + wrapper.status print "Quota Remaining: " + str(wrapper.quotaRemaining) print "Parsing Result: " for node in wrapper.result: print node.type else: print wrapper.status + ": " + wrapper.message
The code performs a POST request to the method parse
of the service, and replicates what is previously done by the Java and Scala samples.
The full Python code is available here.