java - Getting original text after using stanford NLP parser -
hello people of internet,
we're having following problem stanford nlp api: have string want transform list of sentences. first, used string sentencestring = sentence.listtostring(sentence);
listtostring
not return original text because of tokenization. tried use listtooriginaltextstring
in following way:
private static list<string> getsentences(string text) { reader reader = new stringreader(text); documentpreprocessor dp = new documentpreprocessor(reader); list<string> sentencelist = new arraylist<string>(); (list<hasword> sentence : dp) { string sentencestring = sentence.listtooriginaltextstring(sentence); sentencelist.add(sentencestring.tostring()); } return sentencelist; }
this not work. apparently have set attribute " invertible " true don't know how to. how can this?
in general, how use listtooriginaltextstring properly? preparations need?
sincerely, khayet
if understand correctly, want mapping of tokens original input text after tokenization. can this;
//split via ptbtokenizer (ptblexer) list<corelabel> tokens = ptbtokenizer.corelabelfactory().gettokenizer(new stringreader(text)).tokenize(); //do processing using stanford sentence splitter (wordtosentenceprocessor) wordtosentenceprocessor processor = new wordtosentenceprocessor(); list<list<corelabel>> splitsentences = processor.process(tokens); //for each sentence (list<corelabel> s : splitsentences) { //for each word (corelabel token : s) { //here can token value , position like; //token.value(), token.beginposition(), token.endposition() } }
Comments
Post a Comment