diff --git a/read/extraction-ground-truth/1601.03642.txt b/read/extraction-ground-truth/1601.03642.txt
index dfca79e..6e2c10b 100644
--- a/read/extraction-ground-truth/1601.03642.txt
+++ b/read/extraction-ground-truth/1601.03642.txt
@@ -279,8 +279,8 @@ character by character. If the model is good, the text can have
 the correct punctuation. This would not be possible with a
 word predictor.
 
-Character predictors can be implemented with RNNs. In con-
-trast to standard feed-forward neural networks like multilayer
+Character predictors can be implemented with RNNs. In contrast
+to standard feed-forward neural networks like multilayer
 Perceptrons (MLPs) which was shown in Figure 1(b), those
 networks are trained to take their output at some point as well as
 the normal input. This means they can keep some information
@@ -398,8 +398,8 @@ The new feature of Emily Howell compared to Emmy is that
 Emily Howell does not necessarily remain in a single, already
 known style.
 
-Emily Howell makes use of association network. Cope empha-
-sizes that this is not a form of a neural network. However, it
+Emily Howell makes use of association network. Cope emphasizes
+that this is not a form of a neural network. However, it
 is not clear from [Cop13] how exactly an association network
 is trained. Cope mentions that Emily Howell is explained in
 detail in [Cop05].
diff --git a/read/extraction-ground-truth/1602.06541.txt b/read/extraction-ground-truth/1602.06541.txt
index d99f9dc..e24b55f 100644
--- a/read/extraction-ground-truth/1602.06541.txt
+++ b/read/extraction-ground-truth/1602.06541.txt
@@ -5,9 +5,9 @@ info@martin-thoma.de
 
 Abstract—This survey gives an overview over different
 techniques used for pixel-level semantic segmentation.
-Metrics and datasets for the evaluation of segmenta-
-tion algorithms and traditional approaches for segmen-
-tation such as unsupervised methods, Decision Forests
+Metrics and datasets for the evaluation of segmentation
+algorithms and traditional approaches for segmentation
+such as unsupervised methods, Decision Forests
 and SVMs are described and pointers to the relevant
 papers are given. Recently published approaches with
 convolutional neural networks are mentioned and typical
@@ -19,22 +19,22 @@ I. INTRODUCTION
 
 Semantic segmentation is the task of clustering
 parts of images together which belong to the same
-object class. This type of algorithm has several use-
-cases such as detecting road signs [MBLAGJ+07],
-detecting tumors [MBVLG02], detecting medical in-
-struments in operations [WAH97], colon crypts segmen-
-tation [CRSS14], land use and land cover classifica-
-tion [HDT02]. In contrast, non-semantic segmentation
-only clusters pixels together based on general character-
-istics of single objects. Hence the task of non-semantic
+object class. This type of algorithm has several use-cases
+such as detecting road signs [MBLAGJ+07],
+detecting tumors [MBVLG02], detecting medical instruments
+in operations [WAH97], colon crypts segmentation
+[CRSS14], land use and land cover classification
+[HDT02]. In contrast, non-semantic segmentation
+only clusters pixels together based on general characteristics
+of single objects. Hence the task of non-semantic
 segmentation is not well-defined, as many different
 segmentations might be acceptable.
 
 Several applications of segmentation in medicine are
 listed in [PXP00].
 
-Object detection, in comparison to semantic seg-
-mentation, has to distinguish different instances of the
+Object detection, in comparison to semantic segmentation,
+has to distinguish different instances of the
 same object. While having a semantic segmentation
 is certainly a big advantage when trying to get object
 instances, there are a couple of problems: neighboring
@@ -81,8 +81,8 @@ such, the classes on which the algorithm is trained is a
 central design decision.
 
 Most algorithms work with a fixed set of classes;
-some even only work on binary classes like fore-
-ground vs background [RM07], [CS10] or street vs
+some even only work on binary classes like foreground
+vs background [RM07], [CS10] or street vs
 no street [BKTT15].
 
 However, there are also unsupervised segmentation
@@ -105,8 +105,8 @@ is the glass and behind it the table, even if we only had a
 single image and were not allowed to move. This means
 we simultaneously two labels to the coordinates of the
 glass: Glass and table. Although there is much more
-work being done on single class affiliation segmenta-
-tion algorithms, there is a publication about multiple
+work being done on single class affiliation segmentation
+algorithms, there is a publication about multiple
 class affiliation segmentation [LRAL08]. Similarly,
 recent publications in pixel-level object segmentation
 used layered models [YHRF12].
@@ -121,18 +121,18 @@ inference of a segmentation varies by application.
 
 • Grayscale vs colored: Grayscale images are
 commonly used in medical imaging such as
-magnetic resonance (MR) imaging or ultrasonog-
-raphy whereas colored photographs are obviously
+magnetic resonance (MR) imaging or ultrasonography
+whereas colored photographs are obviously
 widespread.
 
 • Excluding or including depth data: RGB-D,
-sometimes also called range [HJBJ+96] is avail-
-able in robotics, autonomous cars and recently
+sometimes also called range [HJBJ+96] is available
+in robotics, autonomous cars and recently
 also in consumer electronics such as Microsoft
 Kinect [Zha12].
 
-• Single image vs stereo images vs co-
-segmentation: Single image segmentation is the
+• Single image vs stereo images vs co-segmentation:
+Single image segmentation is the
 most wide-spread kind of segmentation, but using
 stereo images was already tried in [BVZ01]. It can
 be seen as a more natural way of segmentation as
@@ -149,8 +149,8 @@ of information to find a meaningful segmentation.
 This idea can be extended to time series such as
 videos.
 
-• 2D vs 3D: Segmenting images is a 2D segmenta-
-tion task where the smallest unit is called a pixel.
+• 2D vs 3D: Segmenting images is a 2D segmentation
+task where the smallest unit is called a pixel.
 In 3D data, such as volumetric X-ray CT images
 as they were used in [HHR01], the smallest unit
 is called a voxel.
@@ -169,11 +169,10 @@ the algorithm finds a fine-grained segmentation. [BJ00],
 [RKB04], [PS07] describe systems which work in an
 interactive mode.
 
-(a) Example Scene (b) Visualization of a found seg-
-mentation
+(a) Example Scene (b) Visualization of a found segmentation
 
-Figure 1: An example of a scene and a possible visu-
-alization of a found segmentation.
+Figure 1: An example of a scene and a possible visualization
+of a found segmentation.
 
 III. EVALUATION AND DATASETS
 
@@ -187,8 +186,8 @@ there are other measures of quality which matter when
 segmentation algorithms are compared. This section
 gives an overview of those quality measures.
 
-1) Accuracy: Showing the correctness of the segmen-
-tation hypotheses is done in most publications about
+1) Accuracy: Showing the correctness of the segmentation
+hypotheses is done in most publications about
 semantic segmentation. However, there are a couple
 of different ways how this accuracy can be displayed.
 One way to give readers a first qualitative impression
@@ -213,20 +212,14 @@ One way to compare segmentation algorithms is by
 
 the pixel-wise accuracy of the predicted segmentation
 as done in many publications [SWRC06], [CP08],
-[LSD14]. This is also called per-pixel rate and de-
-fined as
-
-∑k
-i=1 nii∑k
-i=1 ti
-
-. Taking the pixel-wise classification
+[LSD14]. This is also called per-pixel rate and defined
+as ∑^k_{i=1} n_{ii}/∑k_{i=1} t_i. Taking the pixel-wise classification
 accuracy has two major drawbacks:
 P1 Tasks like segmenting images for autonomous cars
 
 have large regions which have one class. This
 makes achieving classification accuracies of more
-than 30 % with a priori knowledge only possible.
+than 30% with a priori knowledge only possible.
 For example, a system might learn that a certain
 position of the image is most of the time “sky”
 while another position is most of the time “road”.
@@ -240,45 +233,10 @@ car”
 
 Three accuracy metrics which do not suffer from
 problem P1 are used in [LSD14]:
-• mean accuracy: 1k ·
-
-∑k
-i=1
-
-nii
-ti
-∈ [0, 1]
-
-• mean intersection over union:
-1
-k ·
-
-∑k
-i=1
-
-nii
-ti−nii+
-
-∑k
-j=1 nji
-
-∈ [0, 1]
+• mean accuracy: 1/k · ∑^k_{i=1} n_{ii} t_i ∈ [0, 1]
+• mean intersection over union: 1/k · ∑^k_{i=1} n_{ii} t_i−n_{ii} + ∑^k_{j=1} n_{ji} ∈ [0, 1]
 • frequency weighted intersection over union:
-
-(
-∑k
-i=1 ti)
-
-−1 ∑k
-i=1 ti ·
-
-nii
-ti−nii+
-
-∑k
-j=1 nji
-
-∈ [0, 1]
+  (∑ki=1 ti)^{−1} ∑^k_{i=1} t_i · n_{ii} t_i−n_{ii}+∑^k_{j=1} n_{ji} ∈ [0, 1]
 Another problem might be pixels which cannot be
 assigned to one of the known classes. For this reason,
 [SWRC06] makes use of a void class. This class gets
@@ -291,8 +249,8 @@ is giving the confusion matrix as done in [SWRC06].
 However, this approach is not feasible if many classes
 are given.
 
-The F-measure is useful for binary classifica-
-tion task such as the KITTI road segmentation
+The F-measure is useful for binary classification
+task such as the KITTI road segmentation
 benchmark [FKG13] or crypt segmentation as done
 by [CRSS14]. It is calculated as “the harmonic mean
 of the precision and recall” [PH05]:
@@ -309,12 +267,12 @@ Finally, it should be noted that a lot of other measures
 for the accuracy of segmentations were proposed for
 non-semantic segmentation. One of those accuracy
 measures is Normalized Probabilistic Rand (NPR)
-index which was introduced in [UPH05] and eval-
-uated in [CSI+09] on dermoscopy images. Other
+index which was introduced in [UPH05] and evaluated
+in [CSI+09] on dermoscopy images. Other
 non-semantic segmentation measures were introduced
 in [MFTM01], but the reason for creating them seems to
-be to deal with the under-defined task description of non-
-semantic segmentation. These accuracy measures try to
+be to deal with the under-defined task description of non-semantic
+segmentation. These accuracy measures try to
 deal with different levels of coarsity of the segmentation.
 This is much less of a problem in semantic segmentation
 and thus those measures are not explained here.
@@ -334,8 +292,8 @@ very hardware, implementation and in some cases even
 data specific. For example, [HJBJ+96] notes that their
 algorithm needs 10 s on a Sun SparcStation 20. The
 fastest CPU ever produced for this system had 200 MHz.
-Comparing this directly with results which were ob-
-tained using an Intel i7-4820K with 3.9 GHz would not
+Comparing this directly with results which were obtained
+using an Intel i7-4820K with 3.9 GHz would not
 be meaningful.
 
 However, it does still make sense to mention the
@@ -421,22 +379,22 @@ the object boundaries” [SWRC06].
 
 3) Medical Databases: The Warwick-QU Dataset
 consists of 165 images with pixel-level annotation of
-5 classes: “healthy, adenomatous, moderately differen-
-tiated, moderately-to-poorly differentiated, and poorly
+5 classes: “healthy, adenomatous, moderately differentiated,
+moderately-to-poorly differentiated, and poorly
 differentiated” [CSM09]. This dataset is part of the
 Gland Segmentation (GlaS) challenge.
 
-The DIARETDB1 [KKV+14] is a dataset of 89 im-
-ages fundus images. Those images show the interior
+The DIARETDB1 [KKV+14] is a dataset of 89 images
+fundus images. Those images show the interior
 surface of the eye. Fundus images can be used to detect
 diabetic retinopathy. The images have four classes of
 coarse annotations: hard and soft exudates, hemorrhages
 and red small dots.
 
-20 test and additionally 20 training retinal fun-
-dus images are available through the DRIVE data
-set [SAN+04]. The vessels were annotated. Addition-
-ally, [AP11] added vascular features.
+20 test and additionally 20 training retinal fundus
+mages are available through the DRIVE data
+set [SAN+04]. The vessels were annotated. Additionally,
+[AP11] added vascular features.
 
 The Open-CAS Endoscopic Datasets [MHMK+14]
 are 60 images taken from laparoscopic adrenalectomies
@@ -450,22 +408,7 @@ One crowd annotation was obtained for each image by
 a majority vote on a pixel basis of 10 segmentations
 given by 10 different KWs.
 
-Training
-Prediction
-
-Post-
-processing
-
-Window-wise
-Classification
-
-Window
-extraction
-
-Data
-augmentationFeature extraction
-
-Preprocessing
+[IMAGE]
 
 Figure 2: A typical segmentation pipeline gets raw
 pixel data, applies preprocessing techniques
@@ -487,14 +430,14 @@ classifier which operates on fixed-size feature inputs
 and a sliding-window approach [DT05], [YBCK10],
 [SCZ08]. This means a classifier is trained on images
 of a fixed size. The trained classifier is then fed with
-rectangular regions of the image which are called win-
-dows. Although the classifier gets an image patch of e.g.
-51 px×51 px of the environment, it might only classify
+rectangular regions of the image which are called windows.
+Although the classifier gets an image patch of e.g.
+51 px × 51 px of the environment, it might only classify
 the center pixel or a subset of the complete window.
 This segmentation pipeline is visualized in Figure 2.
 
-This approach was taken by [BKTT15] and a major-
-ity of the VOC2007 participants [EVGW+a]. As this
+This approach was taken by [BKTT15] and a majority
+of the VOC2007 participants [EVGW+a]. As this
 approach has to apply the patch classifier 512 · 512 =
 262 144 times for images of size 512 px×512 px, there
 are techniques for speeding it up such as applying a
@@ -510,8 +453,6 @@ Conditional Random Fields (CRFs) which take the
 information of the complete image and segment it in
 an holistic approach.
 
-http://host.robots.ox.ac.uk:8080/
-
 V. TRADITIONAL APPROACHES
 
 Image segmentation algorithms which use traditional
@@ -526,8 +467,8 @@ Fields in Section V-E and Support Vector Machines
 (SVMs) in Section V-D. Postprocessing is covered in
 Section V-G.
 
-It should be noted that algorithms can use combina-
-tion of methods. For example, [TNL14] makes use of a
+It should be noted that algorithms can use combination
+of methods. For example, [TNL14] makes use of a
 combination of a SVM and a MRF. Also, auto-encoders
 can be used to learn features which in turn can be used
 by any classifier.
@@ -576,9 +517,9 @@ were proposed in [DT05] and are used in [BMBM10],
 
 3) SIFT: Scale-invariant feature transform (SIFT)
 feature descriptors describe keypoints in an image. The
-image patch of the size 16× 16 around the keypoint
+image patch of the size 16×16 around the keypoint
 is taken. This patch is divided in 16 distinct parts of
-the size 4× 4. For each of those parts a histogram of
+the size 4×4. For each of those parts a histogram of
 8 orientations is calculated similar as for HOG features.
 This results in a 128-dimensional feature vector for
 each keypoint.
@@ -606,23 +547,23 @@ classes like humans. However, it is difficult for classes
 like airplanes, ships, organs or cells where the human
 annotators do not know the keypoints. Additionally, the
 keypoints have to be chosen for every single class. There
-are strategies to deal with those problems like viewpoint-
-dependent keypoints. Poselets were used in [BMBM10]
+are strategies to deal with those problems like viewpoint-dependent
+keypoints. Poselets were used in [BMBM10]
 to detect people and in [BBMM11] for general object
 detection of the PASCAL VOC dataset.
 
 6) Textons: A texton is the minimal building block
 of vision. The computer vision literature does not give a
 strict definition for textons, but edge detectors could be
-one example. One might argue that deep learning tech-
-niques with Convolution Neuronal Networks (CNNs)
+one example. One might argue that deep learning techniques
+with Convolution Neuronal Networks (CNNs)
 learn textons in the first filters.
 
 An excellent explanation of textons can be found
 in [ZGWX05].
 
-7) Dimensionality Reduction: High-resolution im-
-ages have a lot of pixels. Having one or more feature per
+7) Dimensionality Reduction: High-resolution images
+have a lot of pixels. Having one or more feature per
 pixel results in well over a million features. This makes
 training difficult while the higher resolution might not
 contain much more information. A simple approach
@@ -662,8 +603,8 @@ directly be applied on the pixels, when one gives a
 feature vector per pixel. Two clustering algorithms are
 k-means and the mean-shift algorithm.
 
-The k-means algorithm is a general-purpose cluster-
-ing algorithm which requires the number of clusters to
+The k-means algorithm is a general-purpose clustering
+algorithm which requires the number of clusters to
 be given beforehand. Initially, it places the k centroids
 randomly in the feature space. Then it assigns each
 data point to the nearest centroid, moves the centroid
@@ -673,10 +614,9 @@ described in [Har75].
 k-means was applied by [CLP98] for medical image
 
 segmentation.
-Another clustering algorithm is the mean-shift algo-
-
-rithm which was introduced by [CM02] for segmen-
-tation tasks. The algorithm finds the cluster centers
+Another clustering algorithm is the mean-shift algorithm
+which was introduced by [CM02] for segmentation
+tasks. The algorithm finds the cluster centers
 by initializing centroids at random seed points and
 iteratively shifting them to the mean coordinate within
 a certain range. Instead of taking a hard range constraint,
@@ -692,8 +632,8 @@ as vertices and an edge weight is a measure of
 dissimilarity such as the difference in color [FH04],
 [Fel]. There are several different candidates for edges.
 
-The 4-neighborhood (north, east, south west) or an 8-
-neighborhood (north, north-east, east, south-east, south,
+The 4-neighborhood (north, east, south west) or an 8-neighborhood
+(north, north-east, east, south-east, south,
 south-west, west, north-west) are plausible choices.
 One way to cut the edges is by building a minimum
 spanning tree and removing edges above a threshold.
@@ -703,8 +643,8 @@ step, the connected components are the segments.
 
 A graph-based method which ranked 2nd in the
 Pascal VOC 2010 challenge [EVGW+10] is described
-in [CS10]. The system makes heavy use of the multi-
-cue contour detector globalPb [MAFM08] and needs
+in [CS10]. The system makes heavy use of the multi-cue
+contour detector globalPb [MAFM08] and needs
 about 10 GB of main memory [CS11].
 
 3) Random Walks: Random walks belong to the
@@ -770,8 +710,8 @@ branch to descend. Each leaf is a class.
 One strength of Random Decision Forests compared
 to many other classifiers like SVMs and neural networks
 is that the scale of measure of the features (nominal,
-ordinal, interval, ratio) can be arbitrary. Another advan-
-tage of Random Decision Forests compared to SVMs,
+ordinal, interval, ratio) can be arbitrary. Another advantage
+of Random Decision Forests compared to SVMs,
 for example, is the speed of training and classification.
 
 Decision trees were extensively studied in the past
@@ -794,11 +734,11 @@ according to an error function.
 Random Decision Forests with texton features (see
 Section V-A6) are applied in [SJC08] for segmentation.
 In the [MSC] dataset, they report a per-pixel accuracy
-rate of 66.9 % for their best system. This system
+rate of 66.9% for their best system. This system
 requires 415 ms for the segmentation of 320 px×213 px
 images on a single 2.7 GHz core. On the Pascal
 VOC 2007 dataset, they report an average per-pixel
-accuracy for their best segmentation system of 42 %.
+accuracy for their best segmentation system of 42%.
 
 An excellent introduction to Random Decision
 Forests for semantic segmentation is given by [SCZ08].
@@ -807,9 +747,9 @@ D. SVMs
 
 SVMs are well-studied binary classifiers which can
 be described by five central ideas. For those ideas, the
-training data is represented as (xi, yi) where xi is the
-feature vector and yi ∈ { −1, 1 } the binary label for
-training example i ∈ { 1, . . . ,m }.
+training data is represented as (x_i, y_i) where x_i is the
+feature vector and yi ∈ {−1, 1} the binary label for
+training example i ∈ {1, ... , m}.
 
 1) If data is linearly separable, it can be separated
 by a hyperplane. There is one hyperplane which
@@ -817,22 +757,13 @@ maximizes the distance to the next datapoints
 (support vectors). This hyperplane should be taken:
 
 minimize
-w,b
-
-1
-
-2
-‖w‖2
-
-s.t. ∀mi=1yi · (〈w,xi〉+ b)︸ ︷︷ ︸
-sgn applied to this gives the classification
-
-≥ 1
+w,b1 2 ‖w‖2  s.t. ∀mi=1yi · (〈w,xi〉+ b)︸ ︷︷ ︸
+sgn applied to this gives the classification ≥ 1
 
 2) Even if the underlying process which generates the
 features for the two classes is linearly separable,
-noise can make the data not separable. The intro-
-duction of slack variables to relax the requirement
+noise can make the data not separable. The introduction
+of slack variables to relax the requirement
 of linear separability solves this problem. The
 trade-off between accepting some errors and a
 more complex model is weighted by a parameter
@@ -840,19 +771,7 @@ C ∈ R+0 . The bigger C, the more errors are
 accepted. The new optimization problem is:
 
 minimize
-w
-
-1
-
-2
-‖w‖2 + C ·
-
-m∑
-i=1
-
-ξi
-
-s.t. ∀mi=1yi · (〈w,xi〉+ b) ≥ 1− ξi
+w1 2 ‖w‖2 + C · m∑ i=1 ξi  s.t. ∀mi=1yi · (〈w,xi〉+ b) ≥ 1− ξi
 
 Note that 0 ≤ ξi ≤ 1 means that the data point
 is within the margin, whereas ξi ≥ 1 means it is
@@ -863,12 +782,7 @@ a soft-margin SVM.
 w and the bias b. The dual problem is to express
 w as a linear combination of the training data xi:
 
-w =
-
-m∑
-i=1
-
-αiyixi
+w = m∑ i=1 αiyixi
 
 where yi ∈ { −1, 1 } represents the class of the
 training example and αi are Lagrange multipliers.
@@ -886,14 +800,7 @@ maximize
 αi
 
 m∑
-i=1
-
-αi −
-1
-
-2
-
-m∑
+i=1 αi −1  2 m∑
 i=1
 
 m∑
@@ -909,8 +816,7 @@ i=1
 
 αiyi = 0
 
-4) Not every dataset is linearly separable. This prob-
-lem is approached by transforming the feature
+4) Not every dataset is linearly separable. This problem is approached by transforming the feature
 vectors x with a non-linear mapping Φ into
 a higher dimensional (probably ∞-dimensional)
 space. As the feature vectors x are only used
@@ -1174,13 +1080,12 @@ sigmoid activation functions
 
 e−x + 1
 
-Krizhevsky et al. implemented those ideas and partici-
-pated in the ImageNet Large-Scale Visual Recognition
+Krizhevsky et al. implemented those ideas and participated
+in the ImageNet Large-Scale Visual Recognition
 Challenge (ILSVRC). The best other system, which
-used SIFT features and Fisher Vectors, had a perfor-
-mance of about 25.7 % while the network by Alex
-Krizhevsky et al. got 17.0 % error rate on the ILSVRC-
-2010 dataset. As a preprocessing step, they downsam-
+used SIFT features and Fisher Vectors, had a performanceof about 25.7% while the network by Alex
+Krizhevsky et al. got 17.0% error rate on the ILSVRC-2010
+dataset. As a preprocessing step, they downsam-
 pled all images to a fixed size of 256 px×256 px before
 they fed the features into their network. This network
 is commonly known as AlexNet.
@@ -1214,8 +1119,8 @@ which should be tested. Those cases might not occur
 often in the training data, but it could still happen in
 the productive system.
 
-I am not aware of any systematic work which exam-
-ined the influence of problems such as the following.
+I am not aware of any systematic work which examined
+the influence of problems such as the following.
 
 A. Lens Flare
 
@@ -1424,27 +1329,24 @@ user/gustavor/chen_isbi_11.pdf
 
 [CP08] G. Csurka and F. Perronnin, “A simple high
 performance approach to semantic segmentation.”
-in BMVC, 2008, pp. 1–10. [Online]. Avail-
-able: http://www.xrce.xerox.com/layout/set/print/
+in BMVC, 2008, pp. 1–10. [Online]. Available:
+http://www.xrce.xerox.com/layout/set/print/
 content/download/16654/118653/file/2008-023.pdf
 
 [CRSS] A. Cohen, E. Rivlin, I. Shimshoni, and
-E. Sabo, “Colon crypt segmentation website.” [On-
-line]. Available: http://mis.haifa.ac.il/~ishimshoni/
+E. Sabo, “Colon crypt segmentation website.” [Online].
+Available: http://mis.haifa.ac.il/~ishimshoni/
 SegmentCrypt/Download.htm
 
 [CRSS14] ——, “Memory based active contour algorithm
 using pixel-level classified images for colon crypt
 segmentation,” Computerized Medical Imaging
 and Graphics, Nov. 2014. [Online]. Available:
-http://mis.haifa.ac.il/~ishimshoni/SegmentCrypt/
-Active%20contour%20based%20on%20pixel-
-level%20classified%20image%20for%20colon%
-20crypts%20segmentation.pdf
+http://mis.haifa.ac.il/~ishimshoni/SegmentCrypt/Active%20contour%20based%20on%20pixel-level%20classified%20image%20for%20colon%20crypts%20segmentation.pdf
 
 [CS10] J. Carreira and C. Sminchisescu, “Constrained
-parametric min-cuts for automatic object segmenta-
-tion,” in Computer Vision and Pattern Recognition
+parametric min-cuts for automatic object segmentation,”
+in Computer Vision and Pattern Recognition
 (CVPR), 2010 IEEE Conference on. IEEE, 2010,
 pp. 3241–3248.
 
@@ -1461,8 +1363,8 @@ and Technology, vol. 15, no. 4, pp. 444–450, 2009.
 [Online]. Available: http://arxiv.org/abs/1009.1020
 
 [CSM09] L. P. Coelho, A. Shariff, and R. F. Murphy, “Nuclear
-segmentation in microscope cell images: a hand-
-segmented dataset and comparison of algorithms,”
+segmentation in microscope cell images: a hand-segmented
+dataset and comparison of algorithms,”
 in Biomedical Imaging: From Nano to Macro,
 2009. ISBI’09. IEEE International Symposium on.
 IEEE, 2009, pp. 518–521. [Online]. Available:
@@ -1476,8 +1378,8 @@ in Computer Vision and Pattern Recognition
 2012, pp. 1656–1663. [Online]. Available: http:
 //pages.cs.wisc.edu/~jiaxu/pub/rwcoseg.pdf
 
-[DHS15] J. Dai, K. He, and J. Sun, “Instance-aware seman-
-tic segmentation via multi-task network cascades,”
+[DHS15] J. Dai, K. He, and J. Sun, “Instance-aware semantic
+segmentation via multi-task network cascades,”
 arXiv preprint arXiv:1512.04412, 2015.
 
 [DT05] N. Dalal and B. Triggs, “Histograms of oriented
@@ -1492,14 +1394,12 @@ abs_all.jsp?arnumber=1467360
 [EVGW+a] M. Everingham, L. Van Gool, C. K. I.
 Williams, J. Winn, and A. Zisserman, “The
 PASCAL Visual Object Classes Challenge
-2007 (VOC2007) Results,” http://www.pascal-
-network.org/challenges/VOC/voc2007/workshop/index.html.
+2007 (VOC2007) Results,” http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html.
 [Online]. Available: http://host.robots.ox.ac.uk:
 8080/pascal/VOC/voc2007/index.html
 
-[EVGW+b] ——, “The PASCAL Visual Object Classes Chal-
-lenge 2012 (VOC2012) Results,” http://www.pascal-
-network.org/challenges/VOC/voc2012/workshop/index.html.
+[EVGW+b] ——, “The PASCAL Visual Object Classes Challenge
+2012 (VOC2012) Results,” http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.
 [Online]. Available: http://host.robots.ox.ac.uk:
 8080/pascal/VOC/voc2012/index.html
 
@@ -1579,15 +1479,13 @@ Fisher, “An experimental comparison of range
 image segmentation algorithms,” Pattern Analysis
 and Machine Intelligence, IEEE Transactions
 on, vol. 18, no. 7, pp. 673–689, Jul. 1996.
-[Online]. Available: http://ieeexplore.ieee.org/xpls/
-abs_all.jsp?arnumber=506791
+[Online]. Available: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=506791
 
 [Ho95] T. K. Ho, “Random decision forests,” in
 Document Analysis and Recognition, 1995.,
 Proceedings of the Third International Conference
 on, vol. 1. IEEE, 1995, pp. 278–282.
-[Online]. Available: http://ect.bell-labs.com/who/
-tkh/publications/papers/odt.pdf
+[Online]. Available: http://ect.bell-labs.com/who/tkh/publications/papers/odt.pdf
 
 [Hus07] Hustvedt, “File:cctv lens flare.jpg,” Wikipedia
 Commons, Nov. 2007. [Online]. Avail-
@@ -1600,16 +1498,14 @@ labeling,” in Computer Vision and Pattern
 Recognition, 2004. CVPR 2004. Proceedings
 of the 2004 IEEE Computer Society Conference
 on, vol. 2, Jun. 2004, pp. II–695–II–702 Vol.2.
-[Online]. Available: http://ieeexplore.ieee.org/xpl/
-login.jsp?tp=&arnumber=1315232
+[Online]. Available: http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=1315232
 
 [JLD03] K. Jiang, Q.-M. Liao, and S.-Y. Dai, “A novel white
 blood cell segmentation scheme using scale-space
 filtering and watershed clustering,” in Machine
 Learning and Cybernetics, 2003 International
 Conference on, vol. 5, Nov 2003, pp. 2820–2825
-Vol.5. [Online]. Available: http://ieeexplore.ieee.org/
-xpl/login.jsp?tp=&arnumber=1260033
+Vol.5. [Online]. Available: http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=1260033
 
 [Kaf07] L. Kaffer, “File:great male leopard in south afrika-
 jd.jpg,” Wikipedia Commons, Jul. 2007. [Online].
@@ -1650,7 +1546,6 @@ visual recognition,” 2015. [Online]. Available:
 http://cs231n.stanford.edu/
 
 [Low04] D. Lowe, “Distinctive image features from scale-
-
 invariant keypoints,” International Journal of
 Computer Vision, vol. 60, no. 2, pp. 91–110, 2004.
 [Online]. Available: http://dx.doi.org/10.1023/B%
@@ -1684,14 +1579,14 @@ IEEE Conference on, June 2008, pp. 1–8.
 abs_all.jsp?arnumber=4587420
 
 [Man12] M. Manske, “File:randabschattung mikroskop
-kamera 6.jpg,” Wikipedia Com-
-mons, Dec. 2012. [Online]. Avail-
-able: https://commons.wikimedia.org/wiki/File:
+kamera 6.jpg,” Wikipedia Commons,
+Dec. 2012. [Online]. Available:
+https://commons.wikimedia.org/wiki/File:
 Randabschattung_Mikroskop_Kamera_6.JPG
 
 [MBLAGJ+07] S. Maldonado-Bascon, S. Lafuente-Arroyo, P. Gil-
-Jimenez, H. Gomez-Moreno, and F. Lopez-
-Ferreras, “Road-sign detection and recognition
+Jimenez, H. Gomez-Moreno, and F. Lopez-Ferreras,
+“Road-sign detection and recognition
 based on support vector machines,” Intelligent
 Transportation Systems, IEEE Transactions on,
 vol. 8, no. 2, pp. 264–278, Jun. 2007.
@@ -1786,8 +1681,8 @@ on, vol. 16, no. 4, pp. 1046–1057, 2007.
 [Online]. Available: http://ieeexplore.ieee.org/xpls/
 abs_all.jsp?arnumber=4130436
 
-[PTN09] N. Plath, M. Toussaint, and S. Nakajima, “Multi-
-class image segmentation using conditional random
+[PTN09] N. Plath, M. Toussaint, and S. Nakajima, “Multi-class
+image segmentation using conditional random
 fields and global classification,” in Proceedings
 of the 26th Annual International Conference on
 Machine Learning. ACM, 2009, pp. 817–824.
@@ -1804,8 +1699,8 @@ Machine learning, vol. 1, no. 1, pp. 81–106,
 Aug. 1986. [Online]. Available: http://dx.doi.org/
 10.1023/A%3A1022643204877
 
-[Qui93] ——, C4.5: Programs for Machine Learning, P. Lan-
-gley, Ed. Morgan Kaufmann Publishers, Inc., 1993.
+[Qui93] ——, C4.5: Programs for Machine Learning, P. Langley,
+Ed. Morgan Kaufmann Publishers, Inc., 1993.
 
 [RKB04] C. Rother, V. Kolmogorov, and A. Blake, “Grabcut:
 Interactive foreground extraction using iterated
@@ -1913,8 +1808,8 @@ Conference on. IEEE, 2005, pp. 34–34.
 viewcontent.cgi?article=1365&context=robotics
 
 [vdMPvdH09] L. J. van der Maaten, E. O. Postma, and H. J.
-van den Herik, “Dimensionality reduction: A com-
-parative review,” Journal of Machine Learning
+van den Herik, “Dimensionality reduction: A comparative
+review,” Journal of Machine Learning
 Research, vol. 10, no. 1-41, pp. 66–71, 2009.
 
 [VOC10] “Voc2010 preliminary results,” 2010. [Online].
diff --git a/read/extraction-ground-truth/1707.09725.txt b/read/extraction-ground-truth/1707.09725.txt
index 03b7bbd..19145c0 100644
--- a/read/extraction-ground-truth/1707.09725.txt
+++ b/read/extraction-ground-truth/1707.09725.txt
@@ -49,8 +49,8 @@ FZI Research Center for Information Technology
 
 Affirmation
 
-Ich versichere wahrheitsgemäß, die Arbeit selbstständig angefertigt, alle benutzten Hilfs-
-mittel vollständig und genau angegeben und alles kenntlich gemacht zu haben, was aus
+Ich versichere wahrheitsgemäß, die Arbeit selbstständig angefertigt, alle benutzten Hilfsmittel
+vollständig und genau angegeben und alles kenntlich gemacht zu haben, was aus
 Arbeiten anderer unverändert oder mit Abänderungen entnommen wurde.
 
 Karlsruhe, Martin Thoma
@@ -66,7 +66,7 @@ Abstract
 
 Convolutional Neural Networks (CNNs) dominate various computer vision tasks since
 Alex Krizhevsky showed that they can be trained effectively and reduced the top-5 error
-from 26.2 % to 15.3 % on the ImageNet large scale visual recognition challenge. Many
+from 26.2% to 15.3% on the ImageNet large scale visual recognition challenge. Many
 aspects of CNNs are examined in various publications, but literature about the analysis
 and construction of neural network architectures is rare. This work is one step to close this
 gap. A comprehensive overview over existing techniques for CNN analysis and topology
@@ -86,16 +86,16 @@ Modelle welche auf Convolutional Neural Networks (CNNs) basieren sind in verschi
 Aufgaben der Computer Vision dominant seit Alex Krizhevsky gezeigt hat dass diese
 effektiv trainiert werden können und er den Top-5 Fehler in dem ImageNet large scale visual
 recognition challenge Benchmark von 26.2 % auf 15.3 % drücken konnte. Viele Aspekte
-von CNNs wurden in verschiedenen Publikationen untersucht, aber es wurden vergleich-
-sweise wenige Arbeiten über die Analyse und die Konstruktion von Neuronalen Netzen
+von CNNs wurden in verschiedenen Publikationen untersucht, aber es wurden vergleichsweise
+wenige Arbeiten über die Analyse und die Konstruktion von Neuronalen Netzen
 geschrieben. Diese Masterarbeit stellt einen Schritt dar um diese Lücke zu schließen. Eine
 umfassende Überblick über Analyseverfahren und Topologielernverfahren wird gegeben. Ein
 neues Verfahren zur Visualisierung der Klassifikationsfehler mit Konfusionsmatrizen wurde
 entwickelt. Basierend auf diesem Verfahren wurden hierarchische Klassifizierer eingeführt
-und evaluiert. Zusätzlich wurden einige bereits in der Literatur beschriebene Beobachtun-
-gen wie z.B. der positive Einfluss von kleinen Batch-Größen, Ensembles, Erhöhung der
-Trainingsdatenmenge durch künstliche Transformationen (Data Augmentation) und die In-
-varianzbildung durch künstliche Transformationen zur Test-Zeit (Test-time transformations)
+und evaluiert. Zusätzlich wurden einige bereits in der Literatur beschriebene Beobachtungen
+wie z.B. der positive Einfluss von kleinen Batch-Größen, Ensembles, Erhöhung der
+Trainingsdatenmenge durch künstliche Transformationen (Data Augmentation) und die Invarianzbildung
+durch künstliche Transformationen zur Test-Zeit (Test-time transformations)
 experimentell bestätigt. Andere Beobachtungen, wie beispielsweise der positive Einfluss
 gelernter Farbraumtransformationen konnten nicht bestätigt werden. Ein Modell welches
 weniger als eine Millionen Parameter nutzt und auf den Benchmark-Datensätzen Asirra,
@@ -257,10 +257,9 @@ Computer vision is the academic field which aims to gain a high-level understand
 low-level information given by raw pixels from digital images.
 
 Robots, search engines, self-driving cars, surveillance agencies and many others have
-applications which include one of the following six problems in computer vision as sub-
-problems:
+applications which include one of the following six problems in computer vision as subproblems:
 
-• Classification:1 The algorithm is given an image and k possible classes. The task is
+• Classification: 1 The algorithm is given an image and k possible classes. The task is
 to decide which of the k classes the image belongs to. For example, an image from
 a self-driving cars on-board camera contains either paved road, unpaved road or
 no road: Which of those given three classes is in the image?
@@ -321,7 +320,7 @@ transition layers in Section 2.4 and nine ways to analyze CNNs are described in
 
 A linear image filter (also called a filter bank or a kernel) is an element F ∈ Rkw×kh×d,
 where kw represents the filter’s width, kh the filter’s height and d the number of input
-channels. The filter F is convolved with the image I ∈ Rw×h×d to produce a new image I ′.
+channels. The filter F is convolved with the image I ∈ Rw×h×d to produce a new image I′.
 The output image I ′ has only one channel. Each pixel I ′(x, y) of the output image gets
 calculated by point-wise multiplication of one filter element with one element of the original
 image I:
@@ -361,7 +360,7 @@ output image, k2 multiplications and k2 additions of the products have to be cal
 One important detail is how boundaries are treated. There are four common ways of
 boundary treatment:
 
-• don’t compute: The image I ′ will be smaller than the original image. I ′ ∈
+• don’t compute: The image I′ will be smaller than the original image. I′ ∈
 R(w−kw+1)×(h−kh+1)×d3 , to be exact.
 • zero padding: The image I is padded by zeros where the filter would access elements
 which do not exist. This will result in edges being detected at the border if the border
@@ -699,8 +698,8 @@ where ⊙ is the Hadamard product
 (A⊙B)i,j := (A)i,j(B)i,j
 
 Hence every value of the input gets set to zero with a dropout probability of p. Typically,
-Dropout is used with p = 0.5. Layers closer to the input usually have a lower dropout prob-
-ability than later layers. In order to keep the expected output at the same value, the
+Dropout is used with p = 0.5. Layers closer to the input usually have a lower dropout probability
+than later layers. In order to keep the expected output at the same value, the
 output of a dropout layer is multiplied with 1
 
 1−p when dropout is enabled [Las17, tf-16b].
@@ -712,8 +711,8 @@ layers as it usually increases the test error as pointed out in [GG16].
 Models which use Dropout can be interpreted as an ensemble of models with different
 numbers of neurons in each layer, but also with weight sharing.
 
-Conceptually similar are DropConnect and networks with stochastic depth. DropCon-
-nect [WZZ+13] is a generalization of Dropout, which sets weights to zero in contrast to
+Conceptually similar are DropConnect and networks with stochastic depth. DropConnect
+[WZZ+13] is a generalization of Dropout, which sets weights to zero in contrast to
 setting the output of a neuron to zero. Networks with stochastic depth as introduced
 in [HSL+16] dropout only complete layers. This can be done by having Residual networks
 which have one identity connection and one residual feature connection. Hence the residual
@@ -906,8 +905,8 @@ but dense blocks have L(L+1)
 2
 connections between layers. The input feature maps are
 
-concatenated in depth. According to the authors, this prevents features from being re-
-learned and allows much fewer filters per convolutional layer. Where AlexNet and VGG-16
+concatenated in depth. According to the authors, this prevents features from being re-learned
+and allows much fewer filters per convolutional layer. Where AlexNet and VGG-16
 have several hundred filters per convolutional layer (see Tables D.2 and D.3), the authors
 used only on the order of 12 feature maps per layer.
 
@@ -1106,8 +1105,8 @@ training data and loses its capability to generalize. At this point the quality
 the training set and the validation set diverge. While the classifier is still improving on
 the training set, it gets worse on the validation and the test set.
 
-When the epoch-loss validation curve has plateaus as in Figure 2.8, this means the opti-
-mization process did not improve for several epochs. Three possible ways to reduce the
+When the epoch-loss validation curve has plateaus as in Figure 2.8, this means the optimization
+process did not improve for several epochs. Three possible ways to reduce the
 problem of plateaus are (i) to change weight initialization if the plateau was at the beginning,
 (ii) regularizing the model or (iii) changing the optimization algorithm.
 
@@ -1180,8 +1179,8 @@ The optimization process might also be stuck in a local minimum.
 • Loss being NAN might be due to too high learning rates. Another reason is division
 by zero or taking the logarithm of zero. In both cases, adding a small constant like
 10−7 fixes the problem.
-• If the loss-epoch validation curve has a plateau at the beginning, the weight initializa-
-tion might be bad.
+• If the loss-epoch validation curve has a plateau at the beginning, the weight initialization
+might be bad.
 
 18
 
@@ -1500,8 +1499,8 @@ the necessary number of input nodes and the number of output nodes which are det
 by the application and the features of the input. They then apply a criterion to insert new
 layers / neurons into the network.
 
-In the following, Cascade-Correlation, Meiosis Networks and Automatic Structure Opti-
-mization are introduced.
+In the following, Cascade-Correlation, Meiosis Networks and Automatic Structure Optimization
+are introduced.
 
 3.1.1. Cascade-Correlation
 
@@ -1515,14 +1514,10 @@ defined by the problem. Create a minimal, fully connected network for those.
 
 2. Training: Train the network until the error no longer decreases.
 
-3. Candidate Generation: Generate candidate nodes. Each candidate node is con-
-nected to all inputs. They are not connected to other candidate nodes and not
+3. Candidate Generation: Generate candidate nodes. Each candidate node is connected
+to all inputs. They are not connected to other candidate nodes and not
 connected to the output nodes.
 
-27
-
-
-
 3. Topology Learning
 
 4. Correlation Maximization: Train the weights of the candidates by maximizing S,
@@ -1601,8 +1596,8 @@ layers or add skip connections.
 
 3.1.3. Automatic Structure Optimization
 
-Automatic Structure Optimization (ASO) was introduced in [BM93] for the task of on-
-line handwriting recognition. It makes use of the confusion matrix C = (cij) ∈ Nk×k≥0
+Automatic Structure Optimization (ASO) was introduced in [BM93] for the task of online
+handwriting recognition. It makes use of the confusion matrix C = (cij) ∈ Nk×k≥0
 (see Section 2.5.2) to guide the topology learning. They define a confusion-symmetry matrix
 S with sij = sji = cij · cji. The maximum of S defines where the ASO algorithm adds
 more parameters. The details how the resources are added are not transferable to CNNs.
@@ -1666,13 +1661,13 @@ algorithm achieves only 23.9 % accuracy [VH13].
 
 Kocmánek shows in [Koc15] that HyperNEAT approaches can achieve 96.47 % accuracy
 on MNIST. Kocmánek mentions that HyperNEAT becomes slower with each hidden layer
-so that not more than three hidden layers could be trained. At the same time, VGG-
-19 [SZ14] already has 19 hidden layers and ResNets are successfully trained with 1202 layers
+so that not more than three hidden layers could be trained. At the same time, VGG-19
+[SZ14] already has 19 hidden layers and ResNets are successfully trained with 1202 layers
 in [HZRS15a].
 
 [LX17] shows that Genetic algorithms can achieve competitive results on MNIST and
-SVHN, but the best results on CIFAR-10 were 7.10 % error whereas the state of the art is
-at 3.74 % [HLW16]. Similarly, the Genetic algorithm achieves 29.03 % error on CIFAR-100,
+SVHN, but the best results on CIFAR-10 were 7.10% error whereas the state of the art is
+at 3.74 % [HLW16]. Similarly, the Genetic algorithm achieves 29.03% error on CIFAR-100,
 but the state of the art is 17.18 % [HLW16].
 
 3.4. Reinforcement Learning
@@ -2203,9 +2198,9 @@ values are, the less information is lost if the filters are replaced by smaller
 
 5. Experimental Evaluation
 
-Figure 5.2.: Violin plots of the distribution of filter weights of a baseline model trained on CIFAR-
-100. The weights of the first layer are relatively evenly spread in the interval [−0.4,+0.4].
-With every layer the interval which contains 95 % of the weights and is centered around
+Figure 5.2.: Violin plots of the distribution of filter weights of a baseline model trained on CIFAR-100.
+The weights of the first layer are relatively evenly spread in the interval [−0.4,+0.4].
+With every layer the interval which contains 95% of the weights and is centered around
 the mean becomes smaller, especially with layer 11 where the feature maps are of
 size 1× 1. In contrast to the other layers, the last convolutional layer has a bimodal
 distribution.
@@ -2524,8 +2519,8 @@ wardrobe + dinosaur + lizard
 + snake, worm + turtle
 
 9 crocodile, lizard, lobster, cater-
-pillar + dinosaur + snake + tur-
-tle, crab
+pillar + dinosaur + snake + turtle,
+crab
 
 6
 
@@ -2585,9 +2580,9 @@ be due to limited training data, overfitting or the small size of 32 px× 32 px
 The experiment also shows that most of the errors are due to not identifying the correct
 cluster. Hence, in this case, more work in improving the root classifier is necessary rather
 than improving the discrimination of classes within a cluster.
-Although the classes within a cluster capture most of the classifications, many misclassifica-
-tions happen outside of the clusters. For example, in cluster 3, a perfect leaf classifier would
-push the accuracy in the full column only to 63.50 % due to errors of the root classifier
+Although the classes within a cluster capture most of the classifications, many misclassifications
+happen outside of the clusters. For example, in cluster 3, a perfect leaf classifier would
+push the accuracy in the full column only to 63.50% due to errors of the root classifier
 where the root classifier does not predict the correct cluster.
 The leaf classifiers use the same topology as the root classifier. By initializing them with
 the root classifiers weights their performance can be pushed at about the inner accuracy.
@@ -2919,12 +2914,12 @@ of the Batch Normalization layers did not noticeably change.
 
 5.11. Learned Color Space Transformation
 
-In [MSM16] it is described that placing one convolutional layer with 10 filters of size 1× 1
-directly after the input and then another convolutional layer with 3 filters of size 1× 1 acts
+In [MSM16] it is described that placing one convolutional layer with 10 filters of size 1×1
+directly after the input and then another convolutional layer with 3 filters of size 1×1 acts
 as a learned transformation in another color space and boosts the accuracy.
 
-This approach was evaluated on CIFAR-100 by adding a convolutional layer with ELU ac-
-tivation and 10 filters followed by another convolutional layer with ELU activation and
+This approach was evaluated on CIFAR-100 by adding a convolutional layer with ELU activation
+and 10 filters followed by another convolutional layer with ELU activation and
 3 filters. The mean accuracy of 10 models was 63.31 % with a standard deviation of 1.37.
 The standard deviation is noticeable higher than the standard deviation of the baseline
 model (0.55) and the accuracy also decreased by 0.07 percentage points. The accuracy of
@@ -2938,11 +2933,11 @@ Hence it is not advisable to use the learned color space transformation.
 
 5.12. Pooling
 
-An alternative to max pooling with stride 2 with a 2× 2 kernel is using a 3× 3 kernel with
+An alternative to max pooling with stride 2 with a 2×2 kernel is using a 3×3 kernel with
 stride 2.
 
 This approach was evaluated on CIFAR-100 by replacing all max pooling layers with the
-3× 3 kernel max pooling (and SAME padding). The mean accuracy of 10 models was 63.32 %
+3×3 kernel max pooling (and SAME padding). The mean accuracy of 10 models was 63.32 %
 (−0.06) and the standard deviation was 0.57 (+0.02). The ensemble achieved 65.15 % test
 accuracy (+0.45).
 
@@ -2970,8 +2965,8 @@ other comparisons of eleven activation functions are given in Table B.3.
 Theoretical explanations why one activation function is preferable to another in some
 scenarios are the following:
 
-• Vanishing Gradient: Activation functions like tanh and the logistic function sat-
-urate outside of the interval [−5, 5]. This means weight updates are very small for
+• Vanishing Gradient: Activation functions like tanh and the logistic function saturate
+outside of the interval [−5, 5]. This means weight updates are very small for
 preceding neurons, which is especially a problem for very deep or recurrent networks as
 described in [BSF94]. Even if the neurons learn eventually, learning is slower [KSH12].
 
@@ -2995,7 +2990,7 @@ As expected, PReLU and ELU performed best. Unexpected was that the logistic func
 tanh and softplus performed worse than the identity and it is unclear why the pure-softmax
 network performed so much better than the logistic function. One hypothesis why the
 logistic function performs so bad is that it cannot produce negative outputs. Hence the
-logistic− function was developed:
+logistic−function was developed:
 
 logistic−(x) =
 1
@@ -3487,10 +3482,10 @@ algorithm in Chapter 4 and evaluated in Sections 4.2 and 5.4. The important insi
 • Ordering the classes in the confusion matrix allows to display the relevant parts even
 for several hundred classes.
 
-• A hierarchy of classifiers based on the classes does not improve the results on CIFAR-
-100. There are three possible reasons for this:
+• A hierarchy of classifiers based on the classes does not improve the results on CIFAR-100.
+ There are three possible reasons for this:
 
-– 32 px× 32 px is too low dimensional
+– 32 px × 32 px is too low dimensional
 
 – 100 classes are not enough for this approach
 
@@ -4247,8 +4242,8 @@ used.
 
 D. Common Architectures
 
-In the following, some of the most important CNN architectures are explained. Understand-
-ing the development of these architectures helps understanding critical insights the machine
+In the following, some of the most important CNN architectures are explained. Understanding
+the development of these architectures helps understanding critical insights the machine
 learning community got in the past years for convolutional networks for image recognition.
 
 It starts with LeNet-5 from 1998, continues with AlexNet from 2012, VGG-16 D from
@@ -4303,8 +4298,8 @@ than fully connected layers.
 D.2. AlexNet
 
 The first CNN which achieved major improvements on the ImageNet dataset was AlexNet [KSH12].
-Its architecture is shown in Figure D.2 and described in Table D.2. It has about 60·106 param-
-eters. A trained AlexNet can be downloaded at www.cs.toronto.edu/g̃uerzhoy/tf_alexnet.
+Its architecture is shown in Figure D.2 and described in Table D.2. It has about 60·106 parameters.
+A trained AlexNet can be downloaded at www.cs.toronto.edu/g̃uerzhoy/tf_alexnet.
 Note that the uncompressed size is at least 60 965 224 floats · 32 bit
 
 float
@@ -4777,9 +4772,8 @@ gradient descent,” in Advances in Neural Information Processing Systems 29
 learning-to-learn-by-gradient-descent-by-gradient-descent.pdf
 
 [AM15] M. T. Alexander Mordvintsev, Christopher Olah, “Inceptionism:
-Going deeper into neural networks,” Jun. 2015. [Online]. Avail-
-able: https://research.googleblog.com/2015/06/inceptionism-going-deeper-
-into-neural.html
+Going deeper into neural networks,” Jun. 2015. [Online]. Available:
+https://research.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html
 
 [Asi17] “Kaggle cats and dogs dataset,” Oct. 2017. [Online]. Available: https:
 //www.microsoft.com/en-us/download/details.aspx?id=54765
@@ -4801,21 +4795,6 @@ Université de Montréal, Tech. Rep. 1337, 2009.
 reinforcement learning,” arXiv preprint arXiv:1611.02167, Nov. 2016. [Online].
 Available: https://arxiv.org/abs/1611.02167
 
-103
-
-https://arxiv.org/abs/1603.04467
-http://papers.nips.cc/paper/6461-learning-to-learn-by-gradient-descent-by-gradient-descent.pdf
-http://papers.nips.cc/paper/6461-learning-to-learn-by-gradient-descent-by-gradient-descent.pdf
-https://research.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html
-https://research.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html
-https://www.microsoft.com/en-us/download/details.aspx?id=54765
-https://www.microsoft.com/en-us/download/details.aspx?id=54765
-http://jmlr.csail.mit.edu/papers/volume13/bergstra12a/bergstra12a.pdf
-http://jmlr.csail.mit.edu/papers/volume13/bergstra12a/bergstra12a.pdf
-https://arxiv.org/abs/1703.10155
-https://arxiv.org/abs/1611.02167
-
-
 [BM93] U. Bodenhausen and S. Manke, Automatically Structured Neural
 Networks For Handwritten Character And Word Recognition. London:
 Springer London, Sep. 1993, pp. 956–961. [Online]. Available: http:
@@ -4863,28 +4842,11 @@ preprint arXiv:1511.07289, Nov. 2015. [Online]. Available: https:
 learning,” arXiv preprint arXiv:1410.0759, Oct. 2014. [Online]. Available:
 https://arxiv.org/abs/1410.0759
 
-104
-
-http://dx.doi.org/10.1007/978-1-4471-2063-6_283
-http://dx.doi.org/10.1007/978-1-4471-2063-6_283
-http://yann.lecun.com/exdb/publis/pdf/boureau-icml-10.pdf
-http://ieeexplore.ieee.org/document/143326/
-https://github.com/fchollet/keras
-http://cs.stanford.edu/~acoates/papers/coatesleeng_aistats_2011.pdf
-http://cs.stanford.edu/~acoates/papers/coatesleeng_aistats_2011.pdf
-http://cs.stanford.edu/~acoates/stl10
-https://arxiv.org/abs/1202.2745v1
-https://arxiv.org/abs/1511.07289
-https://arxiv.org/abs/1511.07289
-https://arxiv.org/abs/1410.0759
-
-
 [DBB+01] C. Dugas, Y. Bengio et al., “Incorporating second-order functional
-knowledge for better option pricing,” in Advances in Neural Infor-
-mation Processing Systems 13 (NIPS), T. K. Leen, T. G. Dietterich,
+knowledge for better option pricing,” in Advances in Neural Information
+Processing Systems 13 (NIPS), T. K. Leen, T. G. Dietterich,
 and V. Tresp, Eds. MIT Press, 2001, pp. 472–478. [Online].
-Available: http://papers.nips.cc/paper/1920-incorporating-second-order-
-functional-knowledge-for-better-option-pricing.pdf
+Available: http://papers.nips.cc/paper/1920-incorporating-second-order-functional-knowledge-for-better-option-pricing.pdf
 
 [DDFK16] S. Dieleman, J. De Fauw, and K. Kavukcuoglu, “Exploiting cyclic symmetry
 in convolutional neural networks,” arXiv preprint arXiv:1602.02660, Feb.
@@ -4923,23 +4885,7 @@ royal astronomical society, vol. 450, no. 2, pp. 1441–1459, 2015.
 exploits interest-aligned manual image categorization,” in ACM Con-
 ference on Computer and Communications Security (CCS), no. 14.
 Association for Computing Machinery, Inc., Oct. 2007. [Online].
-
-105
-
-http://papers.nips.cc/paper/1920-incorporating-second-order-functional-knowledge-for-better-option-pricing.pdf
-http://papers.nips.cc/paper/1920-incorporating-second-order-functional-knowledge-for-better-option-pricing.pdf
-https://arxiv.org/abs/1602.02660
-http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf
-https://arxiv.org/abs/1512.04412
-ftp://ftp.icsi.berkeley.edu/pub/ai/jagota/vol2_6.pdf
-http://cs229.stanford.edu/proj2015/054_report.pdf
-http://cs229.stanford.edu/proj2015/054_report.pdf
-http://papers.nips.cc/paper/5548-discriminative-unsupervised-feature-learning-with-convolutional-neural-networks.pdf
-http://papers.nips.cc/paper/5548-discriminative-unsupervised-feature-learning-with-convolutional-neural-networks.pdf
-
-
-Available: https://www.microsoft.com/en-us/research/publication/asirra-a-
-captcha-that-exploits-interest-aligned-manual-image-categorization/
+Available: https://www.microsoft.com/en-us/research/publication/asirra-a-captcha-that-exploits-interest-aligned-manual-image-categorization/
 
 [EKS+96] M. Ester, H.-P. Kriegel et al., “A density-based algorithm for discovering
 clusters in large spatial databases with noise.” in Kdd, vol. 96, no. 34, 1996,
@@ -4961,8 +4907,8 @@ vol. 28, no. 4, pp. 594–611, Apr. 2006. [Online]. Available: http:
 [FFP03] R. F. Fei-Fei and P. Perona, “Caltech 101,” 2003. [Online]. Available: http:
 //www.vision.caltech.edu/Image_Datasets/Caltech101/Caltech101.html
 
-[FGMR10] P. F. Felzenszwalb, R. B. Girshick et al., “Object detection with discrimina-
-tively trained part-based models,” IEEE transactions on pattern analysis and
+[FGMR10] P. F. Felzenszwalb, R. B. Girshick et al., “Object detection with discriminatively
+trained part-based models,” IEEE transactions on pattern analysis and
 machine intelligence, vol. 32, no. 9, pp. 1627–1645, 2010.
 
 [FL89] S. E. Fahlman and C. Lebiere, “The cascade-correlation learning architecture,”
@@ -5050,22 +4996,6 @@ preprint arXiv:1611.04231, Nov. 2016. [Online]. Available: https:
 based image classification,” arXiv preprint arXiv:1312.5402, Dec. 2013.
 [Online]. Available: https://arxiv.org/abs/1312.5402
 
-107
-
-https://arxiv.org/abs/1506.02158v6
-https://arxiv.org/abs/1412.6071
-http://www.vision.caltech.edu/Image_Datasets/Caltech256/
-http://www.jmlr.org/proceedings/papers/v28/goodfellow13.pdf
-http://www.jmlr.org/proceedings/papers/v28/goodfellow13.pdf
-https://arxiv.org/abs/1608.08614
-http://papers.nips.cc/paper/227-meiosis-networks.pdf
-https://devblogs.nvidia.com/parallelforall/new-features-cuda-7-5/
-https://arxiv.org/abs/1608.06993v1
-https://arxiv.org/abs/1611.04231
-https://arxiv.org/abs/1611.04231
-https://arxiv.org/abs/1312.5402
-
-
 [HPK11] J. Han, J. Pei, and M. Kamber, Data mining: concepts and techniques.
 Elsevier, 2011.
 
@@ -5112,23 +5042,6 @@ https://arxiv.org/abs/1502.01852
 
 [Ima12] “Imagenet large scale visual recognition challenge 2012 (ILSVRC2012),”
 
-108
-
-https://arxiv.org/abs/1607.04381
-http://papers.nips.cc/paper/5784-learning-both-weights-and-connections-for-efficient-neural-network.pdf
-http://papers.nips.cc/paper/5784-learning-both-weights-and-connections-for-efficient-neural-network.pdf
-https://arxiv.org/abs/1207.0580
-https://arxiv.org/abs/1603.09382
-https://arxiv.org/abs/1603.09382
-http://ee.caltech.edu/Babak/pubs/conferences/00298572.pdf
-http://ee.caltech.edu/Babak/pubs/conferences/00298572.pdf
-https://arxiv.org/abs/1503.02531
-https://arxiv.org/abs/1406.4729
-https://arxiv.org/abs/1512.03385v1
-https://arxiv.org/abs/1512.03385v1
-https://arxiv.org/abs/1502.01852
-
-
 2012. [Online]. Available: http://www.image-net.org/challenges/LSVRC/
 2012/nonpub-downloads
 
@@ -5175,26 +5088,6 @@ and neural network approximation,” IEEE Transactions on Information
 Theory, vol. 48, no. 1, pp. 264–275, Jan. 2002. [Online]. Available:
 http://ieeexplore.ieee.org/abstract/document/971754/
 
-109
-
-http://www.image-net.org/challenges/LSVRC/2012/nonpub-downloads
-http://www.image-net.org/challenges/LSVRC/2012/nonpub-downloads
-https://arxiv.org/abs/1502.03167
-https://arxiv.org/abs/1512.07030
-http://karpathy.github.io/2011/04/27/manually-classifying-cifar10/
-http://karpathy.github.io/2011/04/27/manually-classifying-cifar10/
-https://arxiv.org/abs/1412.6980
-https://arxiv.org/abs/1412.6980
-https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf
-https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf
-https://arxiv.org/abs/1609.04836
-http://kocmi.tk/photos/DiplomaThesis.pdf
-https://arxiv.org/abs/1511.06530
-https://www.cs.toronto.edu/~kriz/cifar.html
-https://www.cs.toronto.edu/~kriz/cifar.html
-http://ieeexplore.ieee.org/abstract/document/971754/
-
-
 [KSH12] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
 with deep convolutional neural networks,” in Advances in Neural
 Information Processing Systems 25 (NIPS), F. Pereira, C. J. C. Burges
@@ -5240,25 +5133,6 @@ processing. IEEE, 2013, pp. 8595–8598. [Online]. Available: http:
 
 [LG16] A. Lavin and S. Gray, “Fast algorithms for convolutional neural networks,” in
 
-110
-
-http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
-http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
-http://papers.nips.cc/paper/4133-learning-convolutional-feature-hierarchies-for-visual-recognition.pdf
-http://papers.nips.cc/paper/4133-learning-convolutional-feature-hierarchies-for-visual-recognition.pdf
-https://arxiv.org/abs/1512.02325
-http://lasagne.readthedocs.io/en/latest/modules/layers/noise.html#lasagne.layers.DropoutLayer
-http://lasagne.readthedocs.io/en/latest/modules/layers/noise.html#lasagne.layers.DropoutLayer
-http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf
-http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf
-http://www.nature.com/nature/journal/v521/n7553/abs/nature14539.html
-http://dx.doi.org/10.1007/3-540-49430-8
-http://yann.lecun.com/exdb/publis/pdf/lecun-90b.pdf
-http://yann.lecun.com/exdb/publis/pdf/lecun-90b.pdf
-http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6639343
-http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6639343
-
-
 Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Sep.
 2016, pp. 4013–4021. [Online]. Available: https://arxiv.org/abs/1509.09308
 
@@ -5306,22 +5180,6 @@ relu_hybrid_icml2013_final.pdf
 
 [MM15] D. Mishkin and J. Matas, “All you need is a good init,” arXiv
 
-111
-
-https://arxiv.org/abs/1509.09308
-https://arxiv.org/abs/1509.08985v2
-https://arxiv.org/abs/1608.03983
-https://arxiv.org/abs/1608.03983
-https://arxiv.org/abs/1603.06560
-https://arxiv.org/abs/1606.01885
-https://arxiv.org/abs/1411.4038v2
-https://arxiv.org/abs/1703.01513
-https://github.com/titu1994/DenseNet
-http://lear.inrialpes.fr/people/marszalek/data/ig02/
-https://web.stanford.edu/~awni/papers/relu_hybrid_icml2013_final.pdf
-https://web.stanford.edu/~awni/papers/relu_hybrid_icml2013_final.pdf
-
-
 preprint arXiv:1511.06422, Nov. 2015. [Online]. Available: https:
 //arxiv.org/abs/1511.06422
 
@@ -5369,21 +5227,6 @@ weight-sharing,” Neural computation, vol. 4, no. 4, pp. 473–493, 1992.
 
 [NH02] R. T. Ng and J. Han, “CLARANS: A method for clustering objects for spatial
 
-112
-
-https://arxiv.org/abs/1511.06422
-https://arxiv.org/abs/1511.06422
-http://ieeexplore.ieee.org/abstract/document/7301739/
-http://ieeexplore.ieee.org/abstract/document/7301739/
-http://ieeexplore.ieee.org/document/4270110/
-http://ieeexplore.ieee.org/document/4270110/
-https://arxiv.org/abs/1606.02228
-https://arxiv.org/abs/1512.02017
-http://papers.nips.cc/paper/5073-learning-with-noisy-labels.pdf
-http://www1.icsi.berkeley.edu/Speech/faq/nn-train.html
-https://www.cs.toronto.edu/~hinton/absps/sunspots.pdf
-
-
 data mining,” IEEE transactions on knowledge and data engineering, vol. 14,
 no. 5, pp. 1003–1016, 2002.
 
@@ -5430,20 +5273,6 @@ evolutionary computation, no. 12. ACM, 2010, pp. 563–570.
 Explaining the predictions of any classifier,” arXiv preprint arXiv:1602.04938,
 Feb. 2016. [Online]. Available: https://arxiv.org/abs/1602.04938
 
-113
-
-http://ufldl.stanford.edu/housenumbers/nips2011_housenumbers.pdf
-http://ufldl.stanford.edu/housenumbers/
-https://arxiv.org/abs/1602.03616
-https://arxiv.org/abs/1608.08984
-https://arxiv.org/abs/1511.04508
-http://dx.doi.org/10.1007/3-540-49430-8_3
-http://dx.doi.org/10.1007/3-540-49430-8_3
-https://arxiv.org/abs/1409.0575
-https://arxiv.org/abs/1505.04597
-https://arxiv.org/abs/1602.04938
-
-
 [Rud16] S. Ruder, “An overview of gradient descent optimization algorithms,”
 arXiv preprint arXiv:1609.04747, Sep. 2016. [Online]. Available: https:
 //arxiv.org/abs/1609.04747
@@ -5490,22 +5319,6 @@ on Computer Vision and Pattern Recognition (CVPR). IEEE, Sep. 2015, pp.
 1–9. [Online]. Available: https://arxiv.org/abs/1409.4842
 
 [SM02] K. O. Stanley and R. Miikkulainen, “Evolving neural networks through
-
-114
-
-https://arxiv.org/abs/1609.04747
-https://arxiv.org/abs/1609.04747
-https://arxiv.org/abs/1204.3968
-http://ieeexplore.ieee.org/document/6792316/
-https://arxiv.org/abs/1312.6229v4
-https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf
-http://ieeexplore.ieee.org/document/6638963/?arnumber=6638963
-https://arxiv.org/abs/1602.07261
-https://arxiv.org/abs/1503.03832
-http://ieeexplore.ieee.org/document/6033589/
-https://arxiv.org/abs/1409.4842
-
-
 augmenting topologies,” Evolutionary computation, vol. 10, no. 2, pp. 99–127,
 2002. [Online]. Available: http://www.mitpressjournals.org/doi/abs/10.1162/
 106365602320169811
@@ -5551,27 +5364,6 @@ https://arxiv.org/abs/1312.6199v4
 [TF-16a] “MNIST for ML beginners,” Dec. 2016. [Online]. Available: https:
 //www.tensorflow.org/tutorials/mnist/beginners/
 
-115
-
-http://www.mitpressjournals.org/doi/abs/10.1162/106365602320169811
-http://www.mitpressjournals.org/doi/abs/10.1162/106365602320169811
-https://arxiv.org/abs/1312.6120
-https://arxiv.org/abs/1312.6120
-https://arxiv.org/abs/1410.1165
-http://benchmark.ini.rub.de/?section=gtsrb&subsection=news
-http://benchmark.ini.rub.de/?section=gtsrb&subsection=news
-http://www.sciencedirect.com/science/article/pii/S0893608012000457
-http://www.sciencedirect.com/science/article/pii/S0893608012000457
-https://arxiv.org/abs/1606.02492
-https://arxiv.org/abs/1512.00567v3
-https://arxiv.org/abs/1312.6034
-https://arxiv.org/abs/1312.6034
-https://arxiv.org/abs/1409.1556
-https://arxiv.org/abs/1312.6199v4
-https://www.tensorflow.org/tutorials/mnist/beginners/
-https://www.tensorflow.org/tutorials/mnist/beginners/
-
-
 [tf-16b] “tf.nn.dropout,” Dec. 2016. [Online]. Available: https://www.tensorflow.org/
 api_docs/python/nn/activation_functions_#dropout
 
@@ -5618,24 +5410,6 @@ http://ieeexplore.ieee.org/document/21701/
 tionist reinforcement learning,” Machine learning, vol. 8, no. 3-4, pp. 229–256,
 1992.
 
-116
-
-https://www.tensorflow.org/api_docs/python/nn/activation_functions_#dropout
-https://www.tensorflow.org/api_docs/python/nn/activation_functions_#dropout
-http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
-http://martin-thoma.com/write-math
-http://martin-thoma.com/write-math
-https://martin-thoma.com/twiddle/
-https://arxiv.org/abs/1602.06541
-https://arxiv.org/abs/1602.06541
-https://arxiv.org/abs/1701.08380
-https://martin-thoma.com/msthesis
-https://arxiv.org/abs/1312.5355
-http://dx.doi.org/10.1007/978-94-015-7744-1_2
-https://arxiv.org/abs/1702.00071
-http://ieeexplore.ieee.org/document/21701/
-
-
 [WWQ13] X. Wang, L. Wang, and Y. Qiao, A Comparative Study of Encoding, Pooling
 and Normalization Methods for Action Recognition. Berlin, Heidelberg:
 Springer Berlin Heidelberg, Nov. 2013, no. 11, pp. 572–585. [Online].
@@ -5683,22 +5457,6 @@ M. Sugiyama et al., Eds. Curran Associates, Inc., Oct. 2016, pp. 1082–1090.
 [Online]. Available: http://papers.nips.cc/paper/6340-doubly-convolutional-
 neural-networks.pdf
 
-117
-
-http://dx.doi.org/10.1007/978-3-642-37431-9_44
-https://arxiv.org/abs/1501.02876v4
-http://www.matthewzeiler.com/pubs/icml2013/icml2013.pdf
-http://www.matthewzeiler.com/pubs/icml2013/icml2013.pdf
-https://arxiv.org/abs/1611.05431v1
-https://arxiv.org/abs/1107.2490
-https://arxiv.org/abs/1505.00853
-https://www.sec.in.tum.de/assets/Uploads/ecai2.pdf
-http://yann.lecun.com/exdb/mnist/
-https://arxiv.org/abs/1611.03530
-http://papers.nips.cc/paper/6340-doubly-convolutional-neural-networks.pdf
-http://papers.nips.cc/paper/6340-doubly-convolutional-neural-networks.pdf
-
-
 [ZDGD14] N. Zhang, J. Donahue et al., “Part-based R-CNNs for fine-grained category
 detection,” in European Conference on Computer Vision (ECCV). Springer,
 Jul. 2014, pp. 834–849. [Online]. Available: https://arxiv.org/abs/1407.3867
@@ -5742,24 +5500,6 @@ arXiv preprint arXiv:1506.02351, Jun. 2015. [Online]. Available: https:
 units,” in International Joint Conference on Neural Networks (IJCNN), Jul.
 2015, pp. 1–4.
 
-118
-
-https://arxiv.org/abs/1407.3867
-https://arxiv.org/abs/1212.5701v1
-https://arxiv.org/abs/1212.5701v1
-https://arxiv.org/abs/1301.3557v1
-https://arxiv.org/abs/1311.2901
-http://places2.csail.mit.edu/download.html
-http://places2.csail.mit.edu/download.html
-https://arxiv.org/abs/1605.07146
-https://arxiv.org/abs/1605.07146
-https://arxiv.org/abs/1512.04150
-https://arxiv.org/abs/1610.02055
-https://arxiv.org/abs/1611.01578
-https://arxiv.org/abs/1506.02351v1
-https://arxiv.org/abs/1506.02351v1
-
-
 I. Glossary
 
 ANN artificial neural network. 4
@@ -5801,9 +5541,6 @@ NEAT NeuroEvolution of Augmenting Topologies. 83
 
 OBD Optimal Brain Damage. 29
 
-119
-
-
 
 PCA principal component analysis. 79
 
@@ -5814,5 +5551,3 @@ ReLU rectified linear unit. 5, 13, 60, 61, 63, 64, 72, 77, 78, 84
 SGD stochastic gradient descent. 5, 30, 45, 46, 82
 
 ZCA Zero Components Analysis. 79
-
-120
diff --git a/read/extraction-ground-truth/2201.00021.txt b/read/extraction-ground-truth/2201.00021.txt
index 2d55922..804b34e 100644
--- a/read/extraction-ground-truth/2201.00021.txt
+++ b/read/extraction-ground-truth/2201.00021.txt
@@ -54,8 +54,8 @@ regarded as a reliable thermometer of molecular clouds (e.g.,
 Walmsley & Ungerechts 1983; Danby et al. 1988), ammonia
 masers have attracted attention since the first detection of maser
 action in the (J,K) = (3,3) metastable (J = K) line toward the
-massive star-forming region W33 (Wilson et al. 1982). Subse-
-quent observations have led to the detection of new metastable
+massive star-forming region W33 (Wilson et al. 1982). Subsequent
+observations have led to the detection of new metastable
 ammonia masers, including 15NH3 (3,3) (Mauersberger et al.
 1986), NH3 (1,1) (Gaume et al. 1996), NH3 (2,2) (Mills et al.
 2018), NH3 (5,5) (Cesaroni et al. 1992), NH3 (6,6) (Beuther
@@ -117,8 +117,8 @@ J = 6 (e.g., Danby et al. 1988).
 NH3 (9,6) masers are found to be strongly variable, similar to
 H2O masers (Madden et al. 1986; Pratap et al. 1991; Henkel et al.
 2013). In W51-IRS2, Henkel et al. (2013) found that the (9,6)
-line showed significant variation in line shape within a time in-
-terval of only two days. Mapping of the (9,6) maser toward W51
+line showed significant variation in line shape within a time interval
+of only two days. Mapping of the (9,6) maser toward W51
 with very long baseline interferometry (VLBI) suggests that the
 masers are closer to the H2O masers than to the OH masers or
 to ultracompact (UC) H ii regions (Pratap et al. 1991). While