Pointer Analysis Homework


Pointer Analysis Homework

 

Introduction

 

Pointer analysis (also called points-to analysis or may-alias analysis or just alias analysis) is a static program analysis that essentially answers the question: Given a pointer expression e in a program, what is the set of possible memory locations to which e may point in executions of the program?

 

Pointer analysis lies at the heart of many program verification and optimization problems. The literature on pointer analysis in the last three decades is extensive.  Despite the large body of work on pointer analysis, however, newer pointer analyses are sought to solve increasingly harder verification and optimization problems, driven by increasingly powerful machines and increasingly complex programs.  Thus, if you need to design a static program analysis for such a problem, it is quite likely that you will have to design your own pointer analysis as well.

 

The goal of this howework is to implement a popular kind of pointer analysis for Java called context-insensitive pointer analysis with on-the-fly call graph construction.  You will use the Chord static analysis framework for Java to implement this analysis.

 

Requirements

 

Ensure that the following software is installed on your machine:

 

 

 

See here for the platforms on which Chord is supported.  It has been tested primarily on Linux though it should also work fine on Windows/Cygwin and Mac OS X.

 

Installation

 

Download Chord from here and uncompress it in a suitable directory by running the following command:

 

    tar xvzf chord.tar.gz

 

This will create a directory structure with root directory chord/.  This directory contains a file named build.xml which is interpreted by Apache Ant.  To see the various targets available run the following command in the root directory:

 

    ant -p

 

The default target compile compiles Chord.  So simply run the following command in the root directory:

 

    ant

 

This will compile the Chord .java source files under chord/main/src/ and chord/asgn/src/, and store the .class files under chord/main/classes/ and chord/asgn/classes, respectively.

 

Description

 

This homework consists of two steps.  The first step involves writing Java code to extract program facts needed by the pointer analysis.  The second step involves writing Datalog rules to express the pointer analysis itself.  The Datalog rules take as input the program facts obtained from Step 1 and produce as output points-to information and the call graph.  These two steps are explained in more detail below.

 

Step 1

 

This step involves writing Java code that extracts program facts needed by the pointer analysis.  The program facts are represented as finite program relations over finite program domains.

 

A program domain is a collection of values of related kind in a given Java program (e.g., methods, types, statements of a certain kind, etc.).  A program domain maps each value contained in it to a unique integer in the range [0..N-1], where N is the total number of values, in the order in which the values are added to the domain.  Directory chord/asgn/src/chord/doms/ contains classes of the form Dom*.java, each defining a different program domain required by the pointer analysis.  The fill() method of each such class expresses how to populate the corresponding program domain.

 

A program relation is a collection of tuples of related kind over one or more program domains.  Each program relation is stored compactly on disk as a BDD (Binary Decision Diagram), which motivates the need to encode values in program domains as integers.  Directory chord/asgn/src/chord/rels/ contains classes of the form Rel*.java, each defining a different program relation required by the pointer analysis.  The fill() method of each such class expresses how to populate the corresponding program relation.  You need to implement this method for each of the following classes:

 

-          RelClinitM

-          RelMmethArg

-          RelMmethRet

-          RelIinvkArg

-          RelIinvkRet

-          RelMobjValAsgnInst

-          RelMobjVarAsgnInst

-          RelMgetInstFldInst

-          RelMgetStatFldInst

-          RelMputInstFldInst

-          RelMputStatFldInst

-          RelMI

-          RelSpecIM

-          RelStatIM

-          RelVirtIM

 

Use the following resources to guide your implementation:

 

  1. Each of the above classes contains a comment describing the program relation it defines.  For instance, RelMobjValAsgnInst.java has the following comment:

    /**

      * Relation containing each tuple (m,v,h) such that method m contains

      * object allocation statement h which assigns to local variable v.

      */

 

  1. Each of the above classes contains an annotation of the form:

    @Chord(

        name = "MobjValAsgnInst",

        sign = "M0,V0,H0:M0_V0_H0"

    )

    This annotation states that the program relation defined by this class is named MobjValAsgnInst, it has three program domains named M, V, and H in order (the zeros in M0, V0, and H0 are used for disambiguation in case there are multiple domains of the same kind), and the BDD variable ordering to be used while storing this relation as a BDD is M0_V0_H0.   You do not need to worry about BDD variable orderings; intuitively, they determine how compactly the relations are stored on disk as BDDs.  You can look at DomM.java, DomV.java, and DomH.java in directory chord/asgn/src/chord/doms/ to see what the program domains named M, V, and H mean.

 

  1. Read the already implemented fill() methods of the remaining classes in directory chord/asgn/src/chord/rels/.

     

  2. Read this Javadoc API to figure how to extract the program facts you want from a given Java program.  As you may have noticed by inspecting some of the already implemented fill() methods, you can get a handle on the Java program by the statement:

 

Program program = project.getProgram();

 

To debug your implementation, you may want to print the intermediate representation of the Java program you are analyzing by running the command:

 

ant –Dchord.work.dir=<dir> -Dchord.print.program=true run

 

where <dir> is the "working directory" of the Java program.  There are three example Java programs provided under directory chord/asgn/tests/, namely, elevator, philo, and tsp; the working directory for each of these programs is the sub-directory named chord.  A file named log.txt is generated in the working directory after each run of Chord that contains useful information logged during the run.  The above command prints the intermediate representation of the Java program to log.txt.  Remember to rename this file (to say program.txt) else it will be over-written in future runs.

 

After you have implemented the fill() method of a particular class, you may want to dump the program relation defined by that class for an example Java program by running the command:

  

ant –Dchord.work.dir=<dir> -Dchord.analysis.list=<name> run

 

where <name> is the name of the program relation you want to print, e.g., MobjValAsgnInst.  Note that the name of a program relation comes from the name field of the @Chord annotation of the class defining that relation.  The above command will dump the relation as a BDD in a file in the working directory, e.g., MobjValAsgnInst.bdd.  To see the tuples in the relation, you can insert a System.out.println(...) call in the fill() method to print each tuple you add to the relation.  Again, remember that the output will go to file log.txt in the working directory.  A neater way to inspect the tuples in a relation stored as a BDD is described in Step 2 below.

 

Step 2

 

This step involves writing Datalog rules to express the pointer analysis.  You must write these rules in file chord/asgn/dlog/pa.dlog.  This file specifies an incomplete Datalog program that takes as input a bunch of program relations (including those whose fill() methods you implemented in Step 1) and produces as output the following four program relations representing points-to information and the call graph:

 

  1. Relation named VH denoting points-to information for local variables.  It contains each tuple (v,h) such that local variable v may point to an object allocated at site h in some execution of the program.
  2. Relation named FH denoting points-to information for global variables.  It contains each tuple (f,h) such that static field f may point to an object allocated at site h in some execution of the program.
  3. Relation named HFH denoting the heap abstraction.  It contains each tuple (h1,f,h2) such that instance field f of an object allocated at site h1 may point to an object allocated at site h2 in some execution of the program.
  4. Relation named IM denoting the call graph.  It contains each tuple (i,m) such that call site i may call target method m in some execution of the program.

 

The pointer analysis to be implemented is informally described below.  In pointer analysis parlance, the analysis is context-insensitive, flow-insensitive, object-insensitive, field-sensitive, inclusion-based (Andersen-style) as opposed to unification-based (Steensgaard-style), and performs on-the-fly (as opposed to ahead-of-time) call-graph construction. 

 

The analysis begins by regarding the main method (which for convenience always has index 0 in domain M) and each class initializer method reachable.  Whenever a method is deemed reachable, the analysis regards all statements in the body of that method (e.g. object allocation sites, call sites, etc.) reachable (recall that the analysis is flow-insensitive).  For each reachable statement, the analysis models the following “transfer function” depending upon the kind of the statement:

 

  1. If the statement is of the form v = new h, then v may point to objects allocated at site h.
  2. If the statement is of the form v1 = v2, then v1 may point to objects pointed to by v2.  The static type of v1 is used to filter objects whose run-time types are not compatible (hint: use relation VHfilter).
  3. If the statement is of the form v1 = v2.f, then v1 may point to objects pointed to by field f of objects pointed to be v1.
  4. If the statement is of the form v1.f = v2 where f is an instance field, then the heap abstraction records the fact that field f of objects pointed to by v1 may point to objects pointed to by v2.
  5. If the statement is of the form f = v where f is a static field, then f may point to objects pointed to by v.
  6. If the statement is of the form v = f where f is a static field, then v may point to objects pointed to by f.
  7. If the statement is a call site of the form v_r1,…,v_rn = m(v_a1, …, v_an) where v_r1, ..., v_rn are return variables and v_a1, ..., v_an are actual arguments, then:
    1. Depending upon the kind of the call, the possible target method(s) must be determined and deemed reachable.  There are three kinds of calls: invoke virtual/interface, invoke special, and invoke static.  The former is dynamically dispatched while the latter two are statically dispatched.  In the case of dynamic dispatching, there may be multiple target methods depending upon the run-time type of the receiver object, namely, for each possible object to which the first argument v_a1 may point, the unique target method must be deemed reachable (hint: use relation cha).  In the case of static dispatching, there is always a single target method, namely, the statically resolved method for the call site.
    2. The flow of data from v_a1, …, v_an to formal arguments of the method, and from return variables of the method to v_r1, …, v_rn must be properly captured.  In both cases, static types of variables should be used to filter objects whose run-time types are not compatible (see item 2 above).

 

For the syntax of Datalog programs, see the two complete Datalog programs in files cha.dlog and VHfilter.dlog in directory chord/asgn/dlog/.

 

To test your implementation, run the following command:

 

ant –Dchord.work.dir=<dir> pa

 

This command will perform both Step 1 and Step 2.  Upon successful completion, it will produce files VH.txt, FH.txt, HFH.txt, and IM.txt in the working directory.  If it does not, then inspect file log.txt to figure out what went wrong.  These four files produced by a reference implementation have already been supplied to you for the three example Java programs under chord/asgn/tests/ in their respective working directories.  To prevent over-writing, they are named VH.correct.txt, FH.correct.txt, etc.  Note that the order of the tuples in the four files produced by your implementation need not be the same as that in those produced by the reference implementation.  You can sort all files (e.g. using the sort utility in Unix) and then compare them (e.g. using the diff utility in Unix).

 

To fix syntactic errors in your Datalog program, run the command:

 

ant –Dchord.work.dir=<dir> -Dchord.dlog.file=asgn/dlog/pa.dlog solve

 

This command presumes that all input program relations declared in the Datalog program exist in *.bdd files in the working directory.  It will print all syntactic errors in your Datalog program to standard output even if the input relations are not present but in this case it will assume that the input relations are all empty, thereby giving incorrect output relations (in files VH.bdd, FH.bdd, etc.)

 

To fix semantic errors in your Datalog program, run the command:

 

ant –Dchord.work.dir=<dir> -Dchord.dlog.file=asgn/dlog/pa_debug.dlog debug

 

This will start a command-line interpreter.  You can now query the contents of a relation, e.g.:

 

            FH(f,h)?

 

This will print to standard output something of the form:

 

(f=<java.lang.Float: java.lang.Class class>(174), h=java/lang/Float.java:112: $r1 = new java.lang.Class(360))

(f=<java.lang.System: java.lang.SecurityManager security>(178),h=sun/misc/Launcher.java:75: $r6 = new java.lang.SecurityManager(3431))

... [elided for brevity]

and more (3641 in total)

 

By default, only 100 arbitrary tuples in the relation will be printed.  To increase the limit, provide argument –Dchord.max.tuples=N to the above ant command, where N is the maximum number of tuples you would like to print.  Each line in the above output is a separate tuple in relation FH.  For instance, the first line states that the field named “<java.lang.Float: java.lang.Class class>” with index 174 in domain F may point to an object allocated at site “java/lang/Float.java:112: $r1 = new java.lang.Class” with index 360 in domain H.  You can relate the names of values with their indices in program domains by inspecting files *.map in the working directory.  For instance, line 361 in file H.map will be “java/lang/Float.java:112: $r1 = new java.lang.Class” (assuming the first line in the file is numbered 1 instead of 0, you need to look at line N + 1 to find the name of the value having index N).

 

Another useful interpreter command is querying whether a relation has any tuples with certain values in certain domains, e.g.:

 

FH(f,360)?

 

This will print to standard output each f such that (f,360) is a tuple in relation FH.

 

Due Date

 

April 14, 2009.

 

How to Submit?

 

You should email me (mayur.naik@gmail.com) a link to your archive, named chord.tar.gz, before 11 am on April 14, 2009.  Please omit directory chord/asgn/tests/ from the archive (especially if you email it as an attachment to me) as it consumes a lot of space.  I will run

 

ant -Dchord.work.dir=<dir> pa

 

where <dir> ranges over example Java programs similar to those provided under chord/asgn/tests/, and see if your implementation produces exactly the same points-to information and call graph (i.e. the relations in files FH.txt, VH.txt, HFH.txt, and IM.txt) as that produced by my implementation.  Please come to Koushik Sen's office on April 15, 2009 afternoon to give a demo.