Skip to content

Commit 961dd2a

Browse files
committed
Merge pull request #4 in PGX/pgx-samples from feature/paper-classification to master
* commit 'ec86daa8e47382432535e1b444948d6a0876d33f': Add Cora example
2 parents aa287bd + ec86daa commit 961dd2a

File tree

12 files changed

+443
-0
lines changed

12 files changed

+443
-0
lines changed

README.md

+5
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@
1212
6. [Article Ranking](#article-ranking)
1313
7. [Movie Recommendation](#movie-recommender)
1414
8. [Entity Linking](#entity-linking)
15+
9. [Research Paper Classification](#paper-classification)
1516

1617
****
1718

@@ -59,3 +60,7 @@ More details regarding this application are available [here](movie-recommendatio
5960
## Entity Linking <a name="entity-linking"></a>
6061
Entity Linking allows to connect Named Entities (for example, names of famous people) to their Wikipedia/DBpedia page.
6162
This application leverages vertex embeddings to provide high-quality results. More details available [here](entity-linking/README.md) and in our [paper](https://dl.acm.org/citation.cfm?doid=3327964.3328499).
63+
64+
## Research Paper Classification <a name="paper-classification"></a>
65+
This application demonstrates how graph data can be used to enhance classification performance of a research paper classifier.
66+
More details regarding this application are available [here](paper-classification/README.md).

paper-classification/README.md

+40
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
Note: this example makes use of the new `SupervisedGraphWise` classifer, which is available in PGX 19.4.0.
2+
# Research Paper Classification - Cora Dataset
3+
Graph analytics often allows you to leverage relational information that other, classical models simply have no
4+
effective way of capturing. One example of such data is the citation network of research papers.
5+
6+
## The Cora dataset
7+
The Cora dataset consists of 2708 machine-learning research papers which have been labeled with one of seven topics.
8+
Each paper has a 1433 dimensional binary feature vector, where 0/1 indicates the absence/presence of selected words from
9+
the dictionary. Additionally, the dataset provides the citation graph, where each edge represents a citation between two
10+
research papers.
11+
12+
To run this example, download [the dataset](https://linqs-data.soe.ucsc.edu/public/lbc/cora.tgz) and extract the
13+
files `cora.cites` and `cora.content` into the `data/graph/cora` directory.
14+
`cora.content` contains the vertex (research paper) feature vectors, and `cora.cites` contains the edges.
15+
16+
## Running the example
17+
The example can be run using the command `./gradlew run`. This will perform the following steps:
18+
1. Load the Cora graph into the PGX server
19+
2. Create a train-test split
20+
3. Train a `SupervisedGraphWise` model on the training graph
21+
4. Evaluate the trained model on the unseen test nodes
22+
5. Infer and save embeddings for all vertices into `embeddings.csv`
23+
24+
## Evaluating the results
25+
We can see that the model performs quite well on the test set: the following is example evaluation output:
26+
27+
|--------------------|--------------------|--------------------|--------------------|
28+
| Accuracy | Precision | Recall | F1-Score |
29+
|--------------------|--------------------|--------------------|--------------------|
30+
| 0.869 | 0.867 | 0.846 | 0.854 |
31+
|--------------------|--------------------|--------------------|--------------------|
32+
33+
The hyper parameters of the model have been left to the default settings, but can of course be tuned to optimize the
34+
performance.
35+
36+
## Visualizing the Embeddings
37+
The embeddings themselves can be used for other downstream tasks, such as cluster or link prediction. To visualize them
38+
in an understandable manner, we can generate and plot t-SNE vectors for the vertices. The following figure is the result.
39+
40+
![Cora Embeddings t-SNE plot](cora_tsne.png "Cora Embeddings TNSE plot")

paper-classification/build.gradle

+34
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
apply plugin: 'java'
2+
jar.enabled = false
3+
4+
ext {
5+
mainClassName = 'Apps.CoraApp'
6+
}
7+
8+
def pgxDir = "$projectDir/../libs/pgx-19.4.0-SNAPSHOT"
9+
def commonDeps = "$pgxDir/shared-lib/common"
10+
def embeddedDeps = "$pgxDir/shared-lib/embedded"
11+
def serverDeps = "$pgxDir/shared-lib/server"
12+
def thirdPartyDeps = "$pgxDir/shared-lib/third-party"
13+
def smCommonDeps = "$pgxDir/shared-memory/common"
14+
def smClientDeps = "$pgxDir/shared-memory/client"
15+
def smEmbeddedDeps = "$pgxDir/shared-memory/embedded"
16+
def smThirdPartyDeps = "$pgxDir/shared-memory/third-party"
17+
18+
dependencies {
19+
compile fileTree(dir: commonDeps, includes: ['*.jar'])
20+
compile fileTree(dir: embeddedDeps, includes: ['*.jar'])
21+
compile fileTree(dir: serverDeps, includes: ['*.jar'])
22+
compile fileTree(dir: thirdPartyDeps, includes: ['*.jar'])
23+
compile fileTree(dir: smCommonDeps, includes: ['*.jar'])
24+
compile fileTree(dir: smClientDeps, includes: ['*.jar'])
25+
compile fileTree(dir: smEmbeddedDeps, includes: ['*.jar'])
26+
compile fileTree(dir: smThirdPartyDeps, includes: ['*.jar'])
27+
}
28+
29+
task run(type: JavaExec, dependsOn: jar) {
30+
jvmArgs = ['-Dpgx_conf=conf/pgx.conf']
31+
classpath = sourceSets.main.runtimeClasspath
32+
main = mainClassName
33+
args = [project.projectDir]
34+
}

paper-classification/conf/pgx.conf

+5
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
{
2+
"datasource_dir_whitelist" : ["PUT_THE_ABSOLUTE_PATH_TO_DATA_DIRECTORY_HERE"],
3+
"allow_local_filesystem": true,
4+
"tmp_dir": "tmp_data"
5+
}

paper-classification/cora_tsne.png

53.1 KB
Loading
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Place the files `cora.content` and `cora.cites` here. See `README.md` for more information.
Binary file not shown.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
distributionBase=GRADLE_USER_HOME
2+
distributionPath=wrapper/dists
3+
zipStoreBase=GRADLE_USER_HOME
4+
zipStorePath=wrapper/dists
5+
distributionUrl=https\://services.gradle.org/distributions/gradle-5.2.1-bin.zip

paper-classification/gradlew

+172
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,172 @@
1+
#!/usr/bin/env sh
2+
3+
##############################################################################
4+
##
5+
## Gradle start up script for UN*X
6+
##
7+
##############################################################################
8+
9+
# Attempt to set APP_HOME
10+
# Resolve links: $0 may be a link
11+
PRG="$0"
12+
# Need this for relative symlinks.
13+
while [ -h "$PRG" ] ; do
14+
ls=`ls -ld "$PRG"`
15+
link=`expr "$ls" : '.*-> \(.*\)$'`
16+
if expr "$link" : '/.*' > /dev/null; then
17+
PRG="$link"
18+
else
19+
PRG=`dirname "$PRG"`"/$link"
20+
fi
21+
done
22+
SAVED="`pwd`"
23+
cd "`dirname \"$PRG\"`/" >/dev/null
24+
APP_HOME="`pwd -P`"
25+
cd "$SAVED" >/dev/null
26+
27+
APP_NAME="Gradle"
28+
APP_BASE_NAME=`basename "$0"`
29+
30+
# Add default JVM options here. You can also use JAVA_OPTS and GRADLE_OPTS to pass JVM options to this script.
31+
DEFAULT_JVM_OPTS=""
32+
33+
# Use the maximum available, or set MAX_FD != -1 to use that value.
34+
MAX_FD="maximum"
35+
36+
warn () {
37+
echo "$*"
38+
}
39+
40+
die () {
41+
echo
42+
echo "$*"
43+
echo
44+
exit 1
45+
}
46+
47+
# OS specific support (must be 'true' or 'false').
48+
cygwin=false
49+
msys=false
50+
darwin=false
51+
nonstop=false
52+
case "`uname`" in
53+
CYGWIN* )
54+
cygwin=true
55+
;;
56+
Darwin* )
57+
darwin=true
58+
;;
59+
MINGW* )
60+
msys=true
61+
;;
62+
NONSTOP* )
63+
nonstop=true
64+
;;
65+
esac
66+
67+
CLASSPATH=$APP_HOME/gradle/wrapper/gradle-wrapper.jar
68+
69+
# Determine the Java command to use to start the JVM.
70+
if [ -n "$JAVA_HOME" ] ; then
71+
if [ -x "$JAVA_HOME/jre/sh/java" ] ; then
72+
# IBM's JDK on AIX uses strange locations for the executables
73+
JAVACMD="$JAVA_HOME/jre/sh/java"
74+
else
75+
JAVACMD="$JAVA_HOME/bin/java"
76+
fi
77+
if [ ! -x "$JAVACMD" ] ; then
78+
die "ERROR: JAVA_HOME is set to an invalid directory: $JAVA_HOME
79+
80+
Please set the JAVA_HOME variable in your environment to match the
81+
location of your Java installation."
82+
fi
83+
else
84+
JAVACMD="java"
85+
which java >/dev/null 2>&1 || die "ERROR: JAVA_HOME is not set and no 'java' command could be found in your PATH.
86+
87+
Please set the JAVA_HOME variable in your environment to match the
88+
location of your Java installation."
89+
fi
90+
91+
# Increase the maximum file descriptors if we can.
92+
if [ "$cygwin" = "false" -a "$darwin" = "false" -a "$nonstop" = "false" ] ; then
93+
MAX_FD_LIMIT=`ulimit -H -n`
94+
if [ $? -eq 0 ] ; then
95+
if [ "$MAX_FD" = "maximum" -o "$MAX_FD" = "max" ] ; then
96+
MAX_FD="$MAX_FD_LIMIT"
97+
fi
98+
ulimit -n $MAX_FD
99+
if [ $? -ne 0 ] ; then
100+
warn "Could not set maximum file descriptor limit: $MAX_FD"
101+
fi
102+
else
103+
warn "Could not query maximum file descriptor limit: $MAX_FD_LIMIT"
104+
fi
105+
fi
106+
107+
# For Darwin, add options to specify how the application appears in the dock
108+
if $darwin; then
109+
GRADLE_OPTS="$GRADLE_OPTS \"-Xdock:name=$APP_NAME\" \"-Xdock:icon=$APP_HOME/media/gradle.icns\""
110+
fi
111+
112+
# For Cygwin, switch paths to Windows format before running java
113+
if $cygwin ; then
114+
APP_HOME=`cygpath --path --mixed "$APP_HOME"`
115+
CLASSPATH=`cygpath --path --mixed "$CLASSPATH"`
116+
JAVACMD=`cygpath --unix "$JAVACMD"`
117+
118+
# We build the pattern for arguments to be converted via cygpath
119+
ROOTDIRSRAW=`find -L / -maxdepth 1 -mindepth 1 -type d 2>/dev/null`
120+
SEP=""
121+
for dir in $ROOTDIRSRAW ; do
122+
ROOTDIRS="$ROOTDIRS$SEP$dir"
123+
SEP="|"
124+
done
125+
OURCYGPATTERN="(^($ROOTDIRS))"
126+
# Add a user-defined pattern to the cygpath arguments
127+
if [ "$GRADLE_CYGPATTERN" != "" ] ; then
128+
OURCYGPATTERN="$OURCYGPATTERN|($GRADLE_CYGPATTERN)"
129+
fi
130+
# Now convert the arguments - kludge to limit ourselves to /bin/sh
131+
i=0
132+
for arg in "$@" ; do
133+
CHECK=`echo "$arg"|egrep -c "$OURCYGPATTERN" -`
134+
CHECK2=`echo "$arg"|egrep -c "^-"` ### Determine if an option
135+
136+
if [ $CHECK -ne 0 ] && [ $CHECK2 -eq 0 ] ; then ### Added a condition
137+
eval `echo args$i`=`cygpath --path --ignore --mixed "$arg"`
138+
else
139+
eval `echo args$i`="\"$arg\""
140+
fi
141+
i=$((i+1))
142+
done
143+
case $i in
144+
(0) set -- ;;
145+
(1) set -- "$args0" ;;
146+
(2) set -- "$args0" "$args1" ;;
147+
(3) set -- "$args0" "$args1" "$args2" ;;
148+
(4) set -- "$args0" "$args1" "$args2" "$args3" ;;
149+
(5) set -- "$args0" "$args1" "$args2" "$args3" "$args4" ;;
150+
(6) set -- "$args0" "$args1" "$args2" "$args3" "$args4" "$args5" ;;
151+
(7) set -- "$args0" "$args1" "$args2" "$args3" "$args4" "$args5" "$args6" ;;
152+
(8) set -- "$args0" "$args1" "$args2" "$args3" "$args4" "$args5" "$args6" "$args7" ;;
153+
(9) set -- "$args0" "$args1" "$args2" "$args3" "$args4" "$args5" "$args6" "$args7" "$args8" ;;
154+
esac
155+
fi
156+
157+
# Escape application args
158+
save () {
159+
for i do printf %s\\n "$i" | sed "s/'/'\\\\''/g;1s/^/'/;\$s/\$/' \\\\/" ; done
160+
echo " "
161+
}
162+
APP_ARGS=$(save "$@")
163+
164+
# Collect all arguments for the java command, following the shell quoting and substitution rules
165+
eval set -- $DEFAULT_JVM_OPTS $JAVA_OPTS $GRADLE_OPTS "\"-Dorg.gradle.appname=$APP_BASE_NAME\"" -classpath "\"$CLASSPATH\"" org.gradle.wrapper.GradleWrapperMain "$APP_ARGS"
166+
167+
# by default we should be in the correct project dir, but when run from Finder on Mac, the cwd is wrong
168+
if [ "$(uname)" = "Darwin" ] && [ "$HOME" = "$PWD" ]; then
169+
cd "$(dirname "$0")"
170+
fi
171+
172+
exec "$JAVACMD" "$@"

paper-classification/gradlew.bat

+84
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
@if "%DEBUG%" == "" @echo off
2+
@rem ##########################################################################
3+
@rem
4+
@rem Gradle startup script for Windows
5+
@rem
6+
@rem ##########################################################################
7+
8+
@rem Set local scope for the variables with windows NT shell
9+
if "%OS%"=="Windows_NT" setlocal
10+
11+
set DIRNAME=%~dp0
12+
if "%DIRNAME%" == "" set DIRNAME=.
13+
set APP_BASE_NAME=%~n0
14+
set APP_HOME=%DIRNAME%
15+
16+
@rem Add default JVM options here. You can also use JAVA_OPTS and GRADLE_OPTS to pass JVM options to this script.
17+
set DEFAULT_JVM_OPTS=
18+
19+
@rem Find java.exe
20+
if defined JAVA_HOME goto findJavaFromJavaHome
21+
22+
set JAVA_EXE=java.exe
23+
%JAVA_EXE% -version >NUL 2>&1
24+
if "%ERRORLEVEL%" == "0" goto init
25+
26+
echo.
27+
echo ERROR: JAVA_HOME is not set and no 'java' command could be found in your PATH.
28+
echo.
29+
echo Please set the JAVA_HOME variable in your environment to match the
30+
echo location of your Java installation.
31+
32+
goto fail
33+
34+
:findJavaFromJavaHome
35+
set JAVA_HOME=%JAVA_HOME:"=%
36+
set JAVA_EXE=%JAVA_HOME%/bin/java.exe
37+
38+
if exist "%JAVA_EXE%" goto init
39+
40+
echo.
41+
echo ERROR: JAVA_HOME is set to an invalid directory: %JAVA_HOME%
42+
echo.
43+
echo Please set the JAVA_HOME variable in your environment to match the
44+
echo location of your Java installation.
45+
46+
goto fail
47+
48+
:init
49+
@rem Get command-line arguments, handling Windows variants
50+
51+
if not "%OS%" == "Windows_NT" goto win9xME_args
52+
53+
:win9xME_args
54+
@rem Slurp the command line arguments.
55+
set CMD_LINE_ARGS=
56+
set _SKIP=2
57+
58+
:win9xME_args_slurp
59+
if "x%~1" == "x" goto execute
60+
61+
set CMD_LINE_ARGS=%*
62+
63+
:execute
64+
@rem Setup the command line
65+
66+
set CLASSPATH=%APP_HOME%\gradle\wrapper\gradle-wrapper.jar
67+
68+
@rem Execute Gradle
69+
"%JAVA_EXE%" %DEFAULT_JVM_OPTS% %JAVA_OPTS% %GRADLE_OPTS% "-Dorg.gradle.appname=%APP_BASE_NAME%" -classpath "%CLASSPATH%" org.gradle.wrapper.GradleWrapperMain %CMD_LINE_ARGS%
70+
71+
:end
72+
@rem End local scope for the variables with windows NT shell
73+
if "%ERRORLEVEL%"=="0" goto mainEnd
74+
75+
:fail
76+
rem Set variable GRADLE_EXIT_CONSOLE if you need the _script_ return code instead of
77+
rem the _cmd.exe /c_ return code!
78+
if not "" == "%GRADLE_EXIT_CONSOLE%" exit 1
79+
exit /b 1
80+
81+
:mainEnd
82+
if "%OS%"=="Windows_NT" endlocal
83+
84+
:omega

0 commit comments

Comments
 (0)