Can predictive coding be used to classify records?

A recent legal case in the United States, Plaintiffs v Peck, may set precedents for the way in which documents are categorised for e-Discovery (through ‘predictive coding’), a development that I think could ultimately impact on the way records are classified.

The presiding judge (Andrew Peck, a United States magistrate judge for the Southern District of New York) wrote an article in Law Technology News in October 2011 titled ‘Search, Forward: Will manual document review and keyword searches by replaced by computer-assisted coding?’.

In his article, Peck described the problems associated with the traditional, manual way of document review. He then refers to ‘… two recent research studies that clearly demonstrate that computerised searches are at least as accurate, if not more so, than manual review’. (The details of these studies are included in the article).

Peck discussed the use of keywords applied to electronic documents, and the poor results that often result (‘average recall was just 20% … (a) result (that) has been replicated … over the past few years’). He notes the generally negative judicial reaction to the use of keywords in e-discovery, partially because the manual process is the ‘gold standard’.

Given the increasingly digital nature of e-discovery, Peck noted the increasing use of ‘computer-assisted coding’, more commonly known as ‘predictive coding’. This methodology is described as ‘tools that use sophisticated algorithms to enable the computer to determine relevance based on interaction with a human reviewer’, using a set of documents to ‘train’ the system.

He stated that, unlike keywords, the (better) results achieved by auto-classification systems (predictive coding) are likely to make the latter approach more appealing in US Courts in the near future.

The specific case referred to (Moore v. Publicis, a ‘high profile employment discrimination case’) involved 3 million electronic documents, and the need to cull them. The plaintiffs objected to the use of predictive coding and criticised ‘… the use of such a novel method of discovery without supporting evidence or procedures for assessing reliability’.

Another Judge was asked to review the case; Judge Carter upheld the decision ‘after finding it to be well reasoned and therefore not subject to reversal.’

For background to the case, see: Plaintiffs v Peck – A Worthy Addition to your Summer Reading List. 10 July 2012. ELLBLOG .

Recommind has produced a short booklet titled ‘Predictive Coding for Dummies’, available as a free, 36-page, pdf. The Dummies Guide notes that ‘Real living, breathing legal experts are essential to predictive coding. These experts use built-in search and analytical tools — including keyword, Boolean and concept search, category grouping, and more than 40 other automatically populated filters — collectively referred to as predictive analytics — to identify documents that need to be reviewed and coded.’ Replace ‘legal experts’ with ‘records managers’ and the role of the records manager is clear.

Of course, finding all the correct documents within a classification is only one part of the requirement. The classification needs to be persistent and connected with other recordkeeping requirements including retention management.

I co-wrote an article for the November issue of IQ, the RIMPA industry quarterly, with Umi Mokhtar from Universiti Kebangsaan Malaysia (UKM) in Malaysia asking the question as to whether technology can classify records better than a human can.

The article noted apparent success rates when the technology was used for legal review, and questioned whether the same technology could be used to classify (or apply classification terms) to records instead of expecting: (a) ‘containers’ to have the right classification terms given their content or (b) users ‘filing’ documents against the correct classification.

I was fortunate to have the chance to sit with one of the predictive coding vendors last week to discuss these issues and my concerns about the effectiveness of the technology to classify records correctly.

I had a similar discussion with a reseller of the same product almost 8 years ago, at the height of EDRMS implementations, when this technology was only seen for its value in finding information.

Roll forwards 8 years, with even more massive amounts of digital information being captured and stored, and EDRMS systems capturing only a small fraction of that information, and the technology looks more appealing as a tool that can support digital recordkeeping.

What struck me most about the technology was the way it presents the results to a user. If you didn’t know it was an advanced search and categorisation engine, you might be forgiven for thinking that the screen of results was actually from an EDRMS.

  • On the left hand side is the classification scheme that I could browse.
  • Click on one of the activities, and I was taken to the subject.
  • The list of results shows all or most of the same basic metadata you would expect in your EDRMS; more if it was added when the record was saved.
  • From the results listed I could find similar documents, see similar or related search results, and add public or private tags.

The technology doesn’t just figure out the classification by itself – it has to be trained, and who better to train the system than records managers? Start with the business classification scheme, find 100 records that match the classification, ask the system to find 1000 and confirm (and manage exceptions). And so on, until all the digital records you have allowed the technology to search (including network drives, email etc) classifies all your digital records.

The link with disposal is there too. The technology classifies all the digital records you point it at. If your classification system is linked to your retention schedules, you are presented with all the information that can be then subject to a retention rule. Keeping the metadata about the records that you destroy is then a matter of capturing the metadata found in the search and applying additional metadata about the disposal action.

Options for using technology like this might include:

  • Applying it against legacy digital stores, to clean up old digital records.
  • Applying it in conjunction with an EDRMS (or SharePoint), so that the technology assigns the correct classification regardless of where the user puts it (or can file it according to that classification).
  • Applying it against legacy and active digital stores, instead of using an EDRMS.

Criticism of predictive coding

The online legal site published an article on 31 October 2012 ‘Pitting Computers Against Humans in Document Review‘. The article:

  • Examined whether a frequently cited (‘TREC’) study (quoted in the article by Peck above) ‘is sufficient to support the conclusions’ that technology is better than humans.
  • Noted several weaknesses in the original TREC study, pointing out that some of the ‘technology-assisted’ teams actually performed miserably.
  • Noted an ‘inherent flaw that the TREC study was not a fair comparison of manual reviewers to the technology-assisted teams.

However, the article also concludes that ‘one should not jump to discredit the usefulness of computer assisted methods’; techological solutions depend on the expertise of the people who use it.

Craig Ball offered an interesting reply to this article leading with the comment: ‘Whoever challenges our assumptions and forces us to defend them is performing a valuable service, no matter what their motives’.

Ball noted the ‘sad fact … that human reviewers perform poorly in a consistent fashion, and we needn’t rely upon TREC or the Grossman/Cormack article alone to prove same’. He added ‘the fact is that the errors human reviewers make are rife, even when well trained and -motivated (if we are candid, an all-too-rare and exceptional circumstance)’. And the ‘errors human reviewers make are overwhelmingly not close calls (but) plainly, manifestly errors of the sort that mindless, heartless computers do not make’.

I think it’s possible to replace ‘human reviewers’ in this context with ‘end users’ who don’t understand classification terms (or, generally, recordkeeping).

My own view is that predictive coding technology has the ability to support digital recordkeeping, with the active involvement of records managers, with or without an EDRMS. It has the ability (once trained) to aggregate records by BCS terms and to apply retention rules to those records. (Incidentally, the same concept is used to put records on legal holds to prevent their disposal).


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s