Did you know ... | Search Documentation: |
Pack logicmoo_nlu -- ext/stanford/README.txt |
Copyright (c) 2002-2017 The Board of Trustees of The Leland Stanford Junior University. All Rights Reserved.
Original core parser code by Dan Klein. Support code, additional modules, languages, features, internationalization, compaction, typed dependencies, etc. by Christopher Manning, Roger Levy, Teg Grenager, Galen Andrew, Marie-Catherine de Marneffe, Jenny Finkel, Spence Green, Bill MacCartney, Anna Rafferty, Huihsin Tseng, Pi-Chuan Chang, Wolfgang Maier, Richard Eckart, Richard Socher, John Bauer, Sebastian Schuster, and Jon Gauthier.
This release was prepared by Jason Bolton.
This package contains 6 parsers: a high-accuracy unlexicalized PCFG; a lexicalized dependency parser; a factored model, where the estimates of dependencies and an unlexicalized PCFG are jointly optimized to give a lexicalized PCFG treebank parser; a TreeRNN parser, where recursive neural networks trained with semantic word vectors are used to score parse trees; a Shift-Reduce Constituency Parser; and a transition-based neural dependency parser. Also included are grammars for various languages for use with these parsers.
For more information about the parser API, point a web browser at the included javadoc directory (use the browser's Open File command to open the index.html file inside the javadoc folder). Start by looking at the Package page for the edu.stanford.nlp.parser.lexparser package, and then look at the page for the LexicalizedParser class documentation therein, particularly documentation of the main method.
Secondly, you should also look at the Parser FAQ on the web:
https://nlp.stanford.edu/software/parser-faq.html
This software requires Java 8 (JDK 1.8.0+). (You must have installed it separately. Check that the command "java -version" works and gives 1.8+.)
QUICKSTART
UNIX COMMAND-LINE USAGE
On a Unix system you should be able to parse the English test file with the following command:
./lexparser.sh data/testsent.txt
This uses the PCFG parser, which is quick to load and run, and quite accurate.
[Notes: it takes a few seconds to load the parser data before parsing begins; continued parsing is quicker. To use the lexicalized parser, replace englishPCFG.ser.gz with englishFactored.ser.gz in the lexparser.sh script and use the flag -mx600m to give more memory to java.]
WINDOWS GUI USAGE
On a Windows system, assuming that java is on your PATH, you should be able to run a parsing GUI by double-clicking on the lexparser-gui.bat icon, or giving the command lexparser-gui in this directory from a command prompt.
Click Load File, Browse, and navigate to and select testsent.txt in the top directory of the parser distribution. Click Load Parser, Browse, and select the models jar, also in the top directory of the parser distribution. From the models jar, select englishPCFG.ser.gz. Click Parse to parse the first sentence.
NEURAL NETWORK DEPENDENCY PARSER USAGE
To use the neural net dependency parser, issue the following command:
java -Xmx2g -cp "*" edu.stanford.nlp.parser.nndep.DependencyParser \ -model edu/stanford/nlp/models/parser/nndep/english_UD.gz \ -textFile data/english-onesent.txt -outFile data/english-onesent.txt.out
The output will be written to data/english-onesent.txt.out
If you want to run on a language other than English, you will need to use a language specific POS tagger. Here is an example for Chinese:
java -Xmx2g -cp "*" edu.stanford.nlp.parser.nndep.DependencyParser \ -model edu/stanford/nlp/models/parser/nndep/UD_Chinese.gz \ -tagger.model edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger \ -textFile data/chinese-onesent-utf8.txt -outFile data/chinese-onesent-utf8.txt.out
OTHER USE CASES
The GUI is also available under Unix:
lexparser-gui.sh
Under Mac OS X, you can double-click on lexparser-gui.command to invoke the GUI. The command-line version works on all platforms. Use lexparser.bat to run it under Windows. The GUI is only for exploring the parser. It does not allow you to save output. You need to use the command-line program or programmatic API to do serious work with the parser.
ADDITIONAL GRAMMARS
The parser is supplied with several trained grammars. There are English grammars based on the standard LDC Penn Treebank WSJ training sections 2-21 (wsj*), and ones based on an augmented data set, better for questions, commands, and recent English and biomedical text (english*).
All grammars are located in the included models jar. (If you'd like to have grammar files like in older versions of the parser, you can get them by extracting them from the jar file with the 'jar -xf' command.)
MULTILINGUAL PARSING In addition to the English grammars, the parser comes with trained grammars for Arabic, Chinese, French, and German. To parse with these grammars, run
lexparser-lang.sh
with no arguments to see usage instructions. You can change language-specific settings passed to the parser by modifying lexparser_lang.def.
You can also train and evaluate new grammars using:
lexparser-lang-train-test.sh
To see how we trained the grammars supplied in this distribution, see
bin/makeSerialized.csh
You will not be able to run this script (since it uses Stanford-specific file paths), but you should be able to see what we did.
Arabic
Trained on parts 1-3 of the Penn Arabic Treebank (ATB) using the pre-processing described in (Green and Manning, 2010). The default input encoding is UTF-8 Arabic script. You can convert text in Buckwalter encoding to UTF-8 with the package edu.stanford.nlp.international.arabic.Buckwalter which is included in stanford-parser.jar.
The parser requires segmentation and tokenization of raw text per the ATB standard prior to parsing. You can generate this segmentation and tokenization with the Stanford Word Segmenter, which is available separately at:
https://nlp.stanford.edu/software/segmenter.html
Chinese
There are Chinese grammars trained just on mainland material from Xinhua and more mixed material from the LDC Chinese Treebank. The default input encoding is GB18030.
French
Trained on the functionally annotated section of the French Treebank (FTB) using the pre-processing described in (Green et al., 2011). For raw text input, a tokenizer is enabled by default that produces FTB tokenization. To disable this tokenizer, use the "-tokenized" option. To tokenize raw text separately, see the usage information in edu.stanford.nlp.international.french.process.FrenchTokenizer.
German
Trained on the Negra corpus. Details are included in (Rafferty and Manning, 2008).
TREEBANK PREPROCESSING
The pre-processed versions of the ATB described in (Green and Manning, 2010) and the FTB described in (Green et al., 2011) can be reproduced using the TreebankPreprocessor included in this release. The configuration files are located in /conf. For example, to create the ATB data, run:
bin/run-tb-preproc -v conf/atb-latest.conf
Note that you'll need to update the conf file paths to your local treebank distributions as the data is not distributed with the parser. You'll also need to set the classpath in the cmd_line variable of run-tb-preproc.
The TreebankPreprocessor conf files support various options, which are documented in
edu.stanford.nlp.international.process.ConfigParser
EVALUATION METRICS
The Stanford parser comes with Java implementations of the following evaluation metrics:
Dependency Labeled Attachment Evalb (Collins, 1997) -Includes per-category evaluation with the -c option Leaf Ancestor (Sampson and Babarczy, 2003) -Both micro- and macro-averaged score Tagging Accuracy
See the usage instructions and javadocs in the requisite classes located in edu.stanford.nlp.parser.metrics.
UNIVERSAL DEPENDENCIES vs. STANFORD DEPENDENCIES
Since v3.5.2 the default dependency representation is the new Universal Dependencies representation. Universal Dependencies were developed with the goal of being a cross-linguistically valid representation. Note that some constructs such as prepositional phrases are now analyzed differently and that the set of relations was updated. Please look at the Universal Dependencies documentation for more information:
http://www.universaldependencies.org
The parser also still supports the original Stanford Dependencies representation as described in the StanfordDependenciesManual.pdf. Use the flag
-originalDependencies
to obtain original Stanford Dependencies.
LICENSE
// StanfordLexicalizedParser -- a probabilistic lexicalized NL CFG parser // Copyright (c) 2002-2017 The Board of Trustees of // The Leland Stanford Junior University. All Rights Reserved. // // This program is free software; you can redistribute it and/or // modify it under the terms of the GNU General Public License // as published by the Free Software Foundation; either version 2 // of the License, or (at your option) any later version. // // This program is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the // GNU General Public License for more details. // // You should have received a copy of the GNU General Public License // along with this program; if not, write to the Free Software // Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. // // For more information, bug reports, fixes, contact: // Christopher Manning // Dept of Computer Science, Gates 2A // Stanford CA 94305-9020 // USA // parser-support@lists.stanford.edu // https://nlp.stanford.edu/downloads/lex-parser.html
2017-06-09 3.8.0 Updated for compatibility
2016-10-31 3.7.0 new UD models
2015-12-09 3.6.0 Updated for compatibility
2015-04-20 3.5.2 Switch to universal dependencies
2014-08-27 3.4.1 Add Spanish models
2014-06-16 3.4 Shift-reduce parser
2014-01-04 3.3.1 Bugfix release, dependency improvements
2012-07-09 2.0.3 Minor bug fixes
2012-05-22 2.0.2 Supports adding extra data in non-tree format
2012-03-09 2.0.1 Caseless English model added, ready for maven
2012-01-11 2.0.0 Threadsafe!
2011-05-15 1.6.7 Minor bug fixes
2011-04-17 1.6.6 Compatible with tagger, corenlp and tregex.
2002-12-05 1.0 Initial release