Comments on: “Sum” THATCamp possibilities? http://london2010.thatcamp.org/2010/06/17/sum-thatcamp-possibilities/ Just another THATCamp site Fri, 17 Jan 2020 04:47:10 +0000 hourly 1 https://wordpress.org/?v=4.9.12 By: aelang http://london2010.thatcamp.org/2010/06/17/sum-thatcamp-possibilities/#comment-24 Fri, 18 Jun 2010 09:54:42 +0000 http://thatcamplondon.org/?p=92#comment-24 Hello Eric

On your ‘reducing ambiguity’ point: this is fixed fairly easily. You need to run your corpus through a POS (part of speech) tagger which will automatically tag all words with the part of speech they belong to (or that the computer thinks they belong to). Tag sets for parts of speech differ, but the one used by the British National Corpus would use NN1 to mark ‘being’ as a noun, and VBG to mark it as a present participle verb.

The British National Corpus is tagged using CLAWS (ucrel.lancs.ac.uk/claws/) which apparently has 96-97% accuracy, but there are also some open-source POS-taggers out there too. I have never used any of them, though, so I’m afraid I can’t help with recommending any specific ones.

Anouk

]]>