Directory-Summarizer: A Tool for Summarizing Conference Proceedings and Other Document Collections

A couple of weekends ago I put together a python script that utilizes sumy  ( and and pdfminer ( and to summarize all pdf, docx (Word), and .txt files in a user-specified directory, including sub-directories as well. In addition, it lists (but doesn’t summarize) the Powerpoint (.ppt and .pptx files as well. I had recently returned from the Intelligent Transportation Society of America’s Annual Meeting, and had a USB drive with the conference proceedings. The problem is, the files are all just organized into folders by session code (e.g., TS-3), and each session could have a quite diverse range of papers. I wanted a way to quickly scan the proceedings to identify items that might be worth my while to read, and also might serve a similar purpose for others.

The user may also specify how many sentences to include in the summary of each document, as well as which of the summarization algorithms included in sumy that they would like used.

Summarizers generally attempt to determine the most important sentences within a document in terms of describing its content, and present them. They do not really understand a report, and can’t write a new abstract like a human could. So the sentences in the summary to not flow together, but typically do capture the content of the document. In addition to the summary, I pull out the first line in each report, as this is often the title or the first part of the title of the report.

Here’s an example 6-sentence summary the tool produced for one of the papers in the proceedings, related to semi-automated platooning of trucks to reduce fuel consumption. I think it captures the scope of the paper:


This paper provides selected final results from Phase One, which is explored a range of technical and non-technical challenges, including assessing feasible real-world business models within the trucking industry.

Testing in past FHWA EAR research and by project partner Peloton has shown that, due to aerodynamic drafting effects, DATP has the potential to significantly reduce fuel use.

The premise of this research is that taking this technology to full commercialization requires a simpler technical approach (compared to fully automated platooning) which bridges from current trucking operations to DATP.

Data was taken in order to compare the relative distance measurements provided by Dynamic Based Real Time Kinematic (DRTK) and a Delphi automotive RADAR.

This particular road segment was chosen for the initial analysis due to its relatively low traffic volumes (resulting in a data set of manageable size) and limited ingress/egress points (allowing the consideration of trucks that remained on the corridor for an extended distance).

ATA Trucking Trends 2013) indicate that over-the-road operations, with an emphasis on truckload (TL) and line-haul less-than-truckload (LTL) sectors would experience the highest likelihood of encountering the desired DATP attributes.

File Path: E:TS01\2_14620_abstract_2183_0.pdf

The Directory-Summarizer can be used to generate summaries for any collections of documents stored in a master directory, and the code is available on github.

P.S.: I understand that there is a python port of tika that, when the bugs are out, could be dropped in so the summarizer could handle even more file types, or the code could be modified to utilize tika service instance to do the same. If anyone does that, let me know how it goes.


Quick tip: Excel’s “compare” function

Whatever our projects, we’re very likely to have data in an Excel spreadsheet at some point. For debugging by looking at log data, it can sometimes be very useful to compare two log files to find differences. I just learned that Excel 2013 Professional has a built in capability to do that. You need the “Inquire” tab, and then select “compare” from that. It’s an add-on, so query help for “inquire” to see how to add it to the ribbon.