Radiomics is described as the high-throughput extraction of large amounts of image features from radiographic images (1–4). Radiomic features provide quantitative descriptions of objects (tissues, suspected pathology, and anatomic regions) contained within the image data. These mathematical descriptors provide ways to characterize the size, shape, texture, intensity, margin, and other aspects of the imaging features of these objects, with the eventual goal of being able to accomplish a diagnostic imaging task, such as distinguishing benign from malignant nodules, assessing response to therapy, correlating imaging with genomics, or decoding the object's imaging phenotype and predicting survival outcomes (5). Within the NCI's Quantitative Imaging Network (6), there are concerted efforts to bring quantitative imaging and radiomic features into clinical trials to aid in treatment evaluation. Among specific efforts have been investigations into these clinical tasks as well as the sensitivity of extracted radiomic feature values to various aspects of the quantitative imaging chain such as image segmentation (7–8).
Despite the widespread use of radiomic features and even the public availability of software packages to compute radiomic features (9–12), there has been a lack of consistency in the definition of radiomic feature values and how they are calculated. This may be a factor contributing to the lack of reproducibility of results across different data sets and different institutions and specifically across different software packages that perform radiomic feature calculations. Recently, there has been an effort to standardizefeature definitions as described by the Image Biomarker Standardization Initiative (IBSI), which has published a reference manual (13). This reference manual has described a large number of features in detail and has even introduced conventions that provide unique codes to identify each of these features.
Therefore, the primary purpose of this study was to investigate the level of agreement in radiomic feature values that could be achieved across a set of institutions and software packages when using a set of common feature definitions, common image data sets (digital reference objects and patient data sets) and common object definitions (segmentations). Ideally, under these controlled conditions, all institutions and software packages would obtain the same values for all radiomic features for all objects. A secondary purpose of this study was to identify the underlying issues when that did not occur (when there were feature value disagreements) and to address those issues when possible. It should be pointed out that the goal of this study was to neither determine the utility of the identified features nor identify superiority of a single tool, in any specific clinical task. That is, this study does not evaluate which features demonstrate efficacy in predicting whether a lung lesion is benign or malignant or whether a patient is responding to therapy. This study also does not assess how sensitive various features are to different sources of variability in the quantitative imaging chain such as object segmentation or image acquisition and reconstruction parameters; these issues are being addressed in other efforts (7, 14–17). The primary goal is to ascertain how much detail is necessary to be reported on future studies to sufficiently describe a feature such that investigators at different institutions and/or using different software packages can produce consistent results. We consider this to be a very important step in the process of developing robust radiomic features that will ultimately contribute to the use of quantitative imaging methods in clinical trials and clinical practice.
This project was initiated by the positron emission tomography–computed tomography working group of the National Cancer Institute's Quantitative Imaging Network. Ten sites participated in the investigation, which took place in 4 phases. In the first phase, a small subset of radiomic features were identified that would make this project feasible to pursue with some depth. In the second phase, radiomic feature values were calculated by participating sites on a set of digital reference objects (DROs) to help identify potential issues. In the third phase, radiomic feature values were calculated by participating sites for identified objects in a small set of patient image data sets [identified lung nodules from the Lung Image Database Consortium (LIDC) data set (18)]. In the fourth phase, an effort was made to specifically harmonize the methodology for calculating one of the more complex radiomic features and to determine if the level of agreement could be improved. Each of these phases is described in more detail in the following sections.
Features Selected and Their Definitions
Investigators involved in the project agreed to identify 9 radiomic features to be investigated for this project. This subset of features was selected to simultaneously keep the number of features to a manageable level for this project while including features from several key categories including morphology (size and shape), intensity, and image texture. The IBSI reference manual (13) was identified as a reference, and feature definitions would be consistent with the definitions and conventions of that resource. Note that where the IBSI reference manual is cited, the section numbers and codes provided are from Version 10.
All features are defined for an identified object whose boundary is the volume of interest (VOI) accompanying each DRO or patient image data set. The 9 specific features selected for analysis were:
Approximate volume (IBSI Section 3.1.2; code YEKZ): This feature is a commonly used size descriptor that counts the number of voxels within an identified object and multiplies by the voxel size in cubic millimeter. Therefore, the volume is expressed in cubic millimeter.
2D diameter: Although this feature is commonly used in oncology assessments such as RECIST and Lung-RADS (19), it is not described in the IBSI reference manual. For this investigation, we agreed to calculate the diameter in a single axial slice. However, sites were allowed flexibility in both how the slice was selected and in how the longest diameter was calculated. For example, some sites calculated the longest chord from all boundary points on each slice and chose the largest of these as the 2D diameter. Others calculated the longest chord on the slice with the largest area and chose this as the 2D diameter. In all cases, the 2D diameter was specified in millimeter.
3D diameter (IBSI Section 3.1.11, code L0JK): This feature is the distance between the 2 most distant vertices from the set of boundary voxels in the VOI. These vertices are not constrained to lie in the same image plane. In this project, the diameter was specified in millimeter.
Mean intensity (IBSI Section 3.3.1, code Q4LE): This feature is the mean intensity value over the VOI. In this project, because the patient image data sets were computed tomography (CT) scans, and the DROs were scaled apropos to CT scans, the mean intensity was specified in Hounsfield Units (HU).
Standard deviation: This feature is not explicitly defined by the IBSI reference manual; however, it does define the “intensity variance” (Section 3.3.2, code ECT3). In this project, the standard deviation in HU was defined as the square root of the intensity variance.
Kurtosis (IBSI Section 3.3.4, code IPH6). This feature is technically referred to as the excess kurtosis and is based on the fourth moment of the intensity distribution and is used as a measure of peakedness in that intensity distribution. It should be noted that in this definition, kurtosis is corrected by a Fisher correction of −3 to center it on 0 for normal distributions. This feature is dimensionless.
Surface area (IBSI Section 3.1.3, code C0JK): This feature is calculated based on the mesh approach described in IBSI Section 3 and specifically from a VOI mesh defining the surface of the object by summing over the triangular face surface areas (mm2) of the mesh.
Sphericity (IBSI Section 3.1.8, code QCFX): This dimensionless feature describes how sphere-like the volume is. It is based on the ratio between volume (calculated from the mesh used for surface area, not the approximate volume feature used in this study) and surface area.
Gray Level Co-occurrence Matrix entropy (IBSI Section 3.6.4, code TU9B): This feature is described as “Joint Entropy” by the IBSI reference manual and is a texture measure described by Haralick et al. (20) using the Gray Level Co-occurrence Matrix (GLCM) approach. In the IBSI reference manual, there are descriptions of how the matrices are formed and how the features are aggregated (6 different methods for aggregation are described). In the first phases of this project, sites used whatever their local software package allowed in terms of aggregation and other parameters. In the final phase of this project, there was an effort to harmonize the approaches and parameters for this feature. This will be described in more detail in the following sections. This feature is dimensionless.
Image Data Sets
This study used 2 different image data sets. The first data set was a set of DROs created by the participants from Stanford (21). The second data set was a set of 10 patient image studies from the publicly available Lung Image Database Consortium (18) collection hosted on the The Cancer Imaging Archive (TCIA) website (22). For each data set (DRO and patient data sets), we identified a specific object for use in this project. One VOI was created for each object and all sites used that same VOI definition. Based on our experience with a previous project (7), both DICOM and NIfTI formats were created for each image data set and the VOI was provided in each format as well (DICOM Segmentation Object (DSO) as well as NIfTI segmentation boundary). All image data (DRO and patient image data) as well as VOIs in both DSO and NIfTI formats are publicly available at https://doi.org/10.7937/tcia.2020.9era-gg29.
The DROs were created by the participants at Stanford and served as a starting point for our analysis (21). These DROs are generated from continuous 3D functions with known “features” (eg, volumes, mean intensities) that are sampled and stored as DICOM (or NIfTI) images. These features have settable parameters and known definitions as specified in (21). For this project, the DROs were created with a 512 × 512 matrix similar to typical CT images and used a 500-mm display FOV. The slice thickness and spacing were each set to 1.0 mm. This results in voxels that are ∼1.0 × 1.0 × 1.0 mm. The intensity of the DROs were defined using mathematical functions that set the internal gray values. These values can be converted to specific intensity values (eg, HU for CT) when desired. The DROs used in this study used the DICOM fields of rescale intercept [(0028, 1052) with a value of −1024] and rescale slope [(0028, 1053) with a value of 1] similar to CT images. Saved as conventional medical images, these objects can have the same radiomics operations applied to them as to patient images, allowing the accuracy of feature calculations to be determined in a controlled fashion. This is of substantial value when trying to achieve the same feature values across software packages, and these objects play a vital role in helping sites diagnose underlying issues.
Figure 1 shows the 3 DROs used in this study; each was designed to exercise different radiomic features. The DROs created for this project were: (1) a mathematically defined sphere with a radius of 100 mm and with uniform intensities (100 HU) for all voxels within the VOI; (2) a mathematically defined sphere with a radius of 100 mm, but having intensity variation that is sinusoidal in 3 dimensions with a mean of 100 HU, an amplitude of 50 HU, and a wavelength of 10 mm (3) a uniform intensity (100 HU) object with a nonspherical shape that is described by a sinusoidal variation in the location of the surface with a mean radius of 100 mm, an amplitude of 20 mm, and unitless azimuthal and inclination angular frequencies of 9 (ie 9 “bumps” in a cross section). More detailed descriptions are available in the study by Jaggi et al. (21), and the DROs are publicly available in DICOM and NIfTI formats at: https://doi.org/10.7937/tcia.2020.9era-gg29.
Patient Data Sets.
The patient data sets for this study were a subset of patient data sets used in a previous study (7, 14). Specifically, the same 10 cases selected from the LIDC data set (18) that were used in that previous study were used in this study (see details of cases in the online supplemental Appendix). As in that previous study, a single lesion from each case was identified for analysis. That previous study generated VOIs using algorithms from 3 academic institutions and each method performed three repeat runs on each nodule. For this study, and to eliminate one source of variability, 1 VOI was created for each lesion and all sites used that same VOI definition. The specific VOI chosen for each lesion was the first run of the first algorithm [Algorithm 1 in Kalpathy-Cramer et al. (14)]. Based on our experience with that previous project (7), both DICOM and NIfTI formats were created for each image data set, and the VOI was provided in each format as well (DSO and NIfTI segmentation boundary).
Three example lung lesions are shown in Figure 2. The online supplemental Appendix contains a table with details about the lesions including a description of which LIDC data set was used and the range of nodule sizes and densities (1 nodule was clearly calcified). That table also shows that 5 of the 10 cases used contiguous reconstructions (slice spacing = slice thickness), while the other 5 used overlapped reconstructions (slice spacing < slice thickness). Finally, this table also shows that images acquired from different scanners (GE vs Philips) used different values for the conversion from gray levels to HU (DICOM tag rescale intercept; image data sets from GE scanners used −1024 as the rescale intercept, while image data from Philips scanners used rescale intercept of −1000).
Software Packages Used
Feature calculation results were received from 10 different sites. Some sites submitted results from 2 different software packages. The software packages were primarily developed by the participating institutions with the exception of PyRadiomics (10, 23) which is an open source Python-based radiomics package that 3 different sites used (1 used only PyRadiomics and 2 other sites submitted results from their own software as well as data from PyRadiomics). Table 1 summarizes the software packages used in this study as well as whether they used DICOM, DSO, or NIfTI formats for the images and VOIs.
|Site||Software||Image Data Used||VOI Data Used|
|1. Stanford||Quantitative Image Feature Engine (QIFE) (9, 24)||DICOM||DSO|
|1. Stanford||PyRadiomics (10, 23)||DICOM||DSO|
|2. UCLA||Quantitative Image Analysis (QIA) (25–27)||DICOM||NIfTI|
|2. UCLA||PyRadiomics (10, 23)||DICOM||NIfTI|
|3. UW||PMOD (28)||DICOM||NIfTI|
|3. UW||PORTS (GLCM only) (29)||DICOM||NIfTI|
|4. USF||Package 1||DICOM||NIfTI|
|5. Moffit||Package 2||DICOM||NIfTI|
|6. Columbia||Package 3 (17)||DICOM||NIfTI|
|7. Michigan||MiViewer (30, 31)||DICOM||NIfTI|
|8. BC Cancer||SERA (32)||DICOM||NIfTI|
|9. Penn (CBICA)||CaPTK (11)||DICOM||NIfTI|
|10. UCSF||PyRadiomics (10, 23)||DICOM||DSO|
Harmonization of GLCM Entropy
Each site and software package (including the authors of PyRadiomics) was surveyed to determine if they were calculating radiomic features according to the definitions described by the IBSI (above). In some cases, the software packages were developed by others and it could not be precisely determined how the features were being calculated. In the specific case of the GLCM entropy feature, there are several choices in how this feature was calculated and this was determined as best as possible. This included trying to determine parameters related to the formation of the co-occurrence matrix (distance, angle), quantization of gray levels (HU), the number of levels to form the matrix, and whether the implementation required both the source and destination voxel, or just the source voxel, to be contained within the VOI. Other parameters surveyed were related to the calculation of the feature including how the feature aggregation was being performed [in Section 3.6 of the IBSI reference manual (13)].
The survey of the participating sites and their software packages revealed that there was a wide range of approaches to calculating GLCM entropy. The initial decision was made to proceed with whatever default parameters were being used in each site's software package(s) for the purposes of assessing agreement using the DROs and patient data sets. The decision was also subsequently made to determine if sites could harmonize their calculation of the GLCM entropy feature values by using as many common parameter settings (without rewriting any code) as possible.
Specifically, sites harmonized their calculation of GLCM entropy by using the following in their calculations:
Interpolate to isotropic 1-mm voxels.
Calculate the features in 3D with 13 angles and d = 1 mm.
Select 256 gray levels for quantization and use the voxels just within the VOI for this quantization step (rather than the entire image volume).
For the aggregation method, it seems that most software packages were fixed, but for PyRadiomics, there were a few options that could be selected. One was that features were computed from each 3D directional matrix and then averaged over the 3D directions to arrive at a single feature value. Another option was that the feature value is computed from a single matrix after merging all of the 3D directional matrices. These correspond to options (e) “volume without merging” and (f) “volume with full merging,” as described in Figure 3.3 of the IBSI reference manual (13).
It should be noted here that these parameter settings were not determined to be optimal in any way. Instead they were identified as the parameter settings most likely to be common among the software implementations used by our investigators and therefore most likely to lead to values that would be agreed upon by different sites and software packages.
Results of feature calculations were reported for 3 distinct phases of this project: (a) DROs, (b) patient data sets, and (c) harmonized GLCM entropy results for all patient data sets.
For the DROs, each site reported the value of each of the 9 indicated features for each DRO phantom using the indicated software package. The result was that for each feature calculated on each DRO, the mean value, standard deviation, and coefficient of variation (CV) (expressed as a percentage) were calculated across software packages. For the patient data sets, a similar approach was used so that for each feature calculated on each patient data set, the mean value, standard deviation, and CV (expressed as a percentage) were calculated across software packages. For the harmonized GLCM entropy phase of the project, only the GLCM entropy value using the harmonized set of parameters was recorded for each patient image data set for each software package; the mean value, standard deviation and CV (expressed as a percentage) were calculated across software packages.
The results from the 3 investigations (feature calculations performed on DROs, feature calculations performed on patient studies, and GLCM feature calculations after using the harmonized parameter settings described in the previous section) are described in the following sections.
Results were received from 13 different submissions from 10 sites for this part of the study; as indicated in Table 1 above, 2 sites submitted results from their own locally developed software as well as results from PyRadiomics, and 1 group submitted results from their own locally developed software as well as a software package called PMOD. Table 2 shows the results of the CV expressed as percentage (CV%) by phantom and by feature.
|DRO||Approximate Volume||Surface Area||2D Diameter||3D Diameter||Sphericity||Mean Intensity||Standard Deviation||Kurtosisb||GLCM Entropy|
|Intensity Varying Phantom||0.004%||13.41%||0.23%||0.27%||12.82%||0.00%||0.11%||0.31%||50.9%|
|Shape Varying Phantom||0.010%||12.27%||0.71%||0.18%||11.70%||0.00%||—||—||625%|
i] aCV results are expressed as a percentage for each feature and each DRO across all 13 submissions. Note that there was no value for standard deviation and kurtosis for the uniform phantom and shape varying phantom, as the intensity values for these phantoms were all set to the same value (100 HU).
Six of the 9 features calculated show excellent agreement across submissions and phantoms with CV < 1%; specifically, the approximate volume, 2D diameter, 3D diameter, mean intensity, standard deviation, and kurtosis (after Fisher adjustment) features. Larger variations (12%–13%) were observed for the surface area and sphericity features. Finally, extremely large variation values were observed (51%–1001%) for the GLCM entropy feature across submissions and phantoms. One note about the GLCM entropy variation is that in the uniform phantom and the shape varying phantom, the intensity values within the VOI were completely uniform and therefore most software packages returned a GLCM entropy value of 0 (or very close to 0). However, 2 submissions (out of 13) had nonzero GLCM entropy values for these 2 phantoms, resulting in a very small mean value across submissions (0.1 for the uniform phantom and 0.2 for the shape varying phantom) which led to a very large CV.
Patient Data Set Results
For the patient data sets, 13 different submissions originating from 10 sites were also received. Table 3 shows the mean and standard deviation of the CV as percentage (CV%) values for each feature. This was calculated by first calculating the CV for each patient data set across software packages and then calculating the mean and standard deviation of those CVs across all 10 patient cases.
|Coefficient of Variation||Approximate Volume||Surface Area||2D Diameter||3-D Diameter||Sphericity||Mean Intensity||Standard Deviation||Kurtosisb||GLCM Entropy|
|Standard Deviation CV||0.00%||2.49%||5.19%||2.25%||3.57%||0.00%||0.09%||0.74%||4.66%|
These results show that only 4 of the 9 features calculated show the same level of excellent agreement (CV < 1%) as observed with the DROs; specifically, the approximate volume, mean intensity, standard deviation, and kurtosis (after Fisher adjustment) features. Slightly larger variations (mean CV of 3%–9%) were observed for 2D diameter and 3D diameter features. Larger variations (mean CV = 17%) were again observed for the surface area and sphericity features. Finally, larger variations were observed (mean CV = 37%) for the GLCM entropy feature across submissions and patient cases. Note that because these were patient data sets and the intensity values were not completely uniform, the mean GLCM entropy values for each patient data set actually ranged from 5.4 to 7.8 (with a much wider range of individual values from different software packages); this in turn led to smaller CV values than those observed for uniform DROs.
Results from GLCM Entropy Harmonization
For this final investigation, 11 submissions from 9 sites were received in which the harmonized parameter settings for the GLCM entropy calculation (described above) were used. Table 4 shows the individual CV values for each case (calculated across submissions) under both the harmonized parameter settings and each site's default (nonharmonized) settings used to obtain the results in the previous section. Table 4 also shows the mean and standard deviation of the CV as percentage (CV%) values for each feature across cases (and submissions).
|Case||CV (%)Harmonized Settings||CV (%) Default Settings (Table 3 results)|
i] The values are expressed as a percentage for each case (across submissions) as well as the mean and standard deviation across cases. The first column reports the results when using the harmonized parameters described above. The second column reports the results when using the default (non-harmonized) parameters.
These results show a substantially reduced CV when the parameters are harmonized across software packages compared with CV when default settings are used (mean CV of 19.3% vs 36.2%). This is true for each individual case as well. However, it should also be noted that the agreement is still relatively modest (mean CV% ∼20% across cases and software packages), which indicates there are still outstanding issues to resolve to obtain the levels of agreement seen in the features with the excellent levels of agreement (eg, approximate volume, mean intensity, etc.).
The purpose of this work was to investigate the level of agreement among radiomic features that could be achieved when computed by several groups using different software packages. A secondary purpose of this work was to use these investigations to identify issues that led to differences in feature values produced by software packages and to determine if these issues could be readily addressed.
The use of DROs was extremely helpful for both of these purposes and especially the latter. Because the size, shape, and intensity values of each DRO was known, each object provided a unique opportunity to identify when software packages were calculating different values for the features of interest and allowed investigators opportunities to understand the underlying causes behind these differences. Some of these will be discussed in the following paragraphs.
The use of specific lesions in patient data sets was also very helpful in pursuing this project's primary purpose, in that the lesions studied did represent clinical objects of interest. In addition, the 10 cases selected represented different challenges in terms of lesion size and composition, slice thickness, and slice spacing as well as manufacturer (which influences expected aspects, such as contrast-to-noise ratios, as well as unexpected ones, such as the DICOM field rescale intercept).
When our investigators performed this task under very tightly controlled conditions of the same feature definitions, using same image data and the same VOI definition, some features showed excellent agreement (CV < 1%). This was true for the approximate volume, mean intensity, standard deviation, and kurtosis features. The definitions of these features are relatively straightforward and there are not many choices in parameters when calculating these features. The one exception was in the kurtosis feature where some software packages do not include the Fisher adjustment (subtracting 3 from the sum). With the use of the DROs, this was readily identified and corrected.
For other features, the agreement between software packages was not quite as good. The 3D diameter feature did show excellent agreement between software packages for the DROs (CV < 1%), but showed slightly larger variation when more irregularly shaped objects were used in the patient data sets; however, the resulting variability was still reasonably low (CV < 3%) across all 10 patient data sets and software packages. The 2D diameter feature (which was not actually defined by the IBSI) showed larger variation than the 3D diameter. This increased variation may be the result of differences in approach that were allowed for this study. As described above, some sites calculated diameter on the slice with the largest area, while others calculated the diameter for each slice and used the maximum from all slices. Thus, while the DRO results showed little variation (CV < 1% for the spherical phantoms and CV < 5% for the shape varying phantom), there were larger variations (CV = 10%) for the patient data sets, which are likely because of the variations in approach described here. Our investigation did not constrain which approach should be used (nor did we specify the image on which the diameter calculation should be made) so as to reflect conditions typically encountered in the clinic (such as making RECIST measurements for a clinical trial).
The surface area and sphericity features showed larger variability (CV = 12%–18%) in both DRO and patient data set results. First, it should be noted that these 2 features are related, in that sphericity uses the surface area (and volume) in its definition. It should also be noted that the IBSI description of the surface area feature is specific to the use of the mesh-based calculation method. This does require some specification (either implicitly or explicitly) of the size or number of triangles used in the mesh, which was not constrained in this investigation. In addition, some of the software packages used may not have used the mesh-based approach, but may have calculated an approximation to surface area by counting the areas of surface voxels. This disparity in approaches is likely one source of variability in this study.
Finally, the GLCM entropy feature showed the largest amount of variability across software packages. For the DROs, the CV was 51% for the intensity varying phantom and much higher values for the uniform and shape varying phantoms. These latter high values were explored and found to be the product of 2 software packages that calculated nonzero values for a completely uniform object. This was investigated and determined that when forming the co-occurrence matrices, some software packages check to determine if both the source and destination voxels are contained within the VOI while others only check to determine if the source voxel is contained within the VOI. In the former situation, both voxels are within the VOI; for the DRO with uniform voxel values, there will be no intensity variation, and the GLCM entropy value will be 0. In the latter situation, near the boundary of the VOI, the destination voxel may be outside the VOI and then voxels outside the VOI are included in the calculation. For the DRO with uniform values (100 HU) inside the boundary and uniform but different values (−1000 HU) outside the boundary, this will lead to nonzero values for the GLCM entropy values.
Furthermore, for the patient image data sets, the GLCM entropy values showed larger variations (CV = 36% across 10 cases) than other features. We hypothesized that this was due to the large number of choices and parameter settings that go into both the formation of the GLCM matrix and the calculation of the feature itself. These include (but are not limited to) issues related to: (1) any interpolation scheme used either for the image data at the beginning of the matrix formation or later during the formation process; (2) the choice of distance and angle or the use of multiple distances and angles (and any subsequent averaging over distances and/or angles); (3) the discretization scheme (scale and number of quantization bins) used; and (4) the feature aggregation scheme (IBSI describes 6 different schemes).
To determine if the effects of these choices and parameter settings could be mitigated, investigators agreed to a reference set of conditions described in the section on Harmonization of GLCM entropy and recalculated their results. Our results showed that the variability was reduced (CV reduced from 36% to 19%), but still did not approach that of the features with excellent agreement. This indicates that even this level of harmonization was not sufficient to achieve excellent agreement across software packages with this complex feature.
Also, even after harmonization, the results in Table 4 indicated some possible dependence of agreement (CV%) on lesion size. Specifically, lesions 3 and 8 were the smallest lesions, but showed larger than average CV values. In addition, lesion 4 was a calcified lesion and it also showed large CV values across packages. Thus, lesion size and density may have an impact on this specific feature value, but we did not have a large enough sample size of lesions to explore this fully in the current study.
In summary, this study has shown that excellent agreement can be achieved for some features when standardized definitions of features are provided such as those in the IBSI reference manual. This led to excellent agreement among software packages for features such as approximate volume, mean intensity, standard deviation, kurtosis (with Fisher adjustment), and even 3D diameter.
For more complex features such as surface area, sphericity, 2D diameter, and even GLCM entropy, very complete definitions of specific approaches and parameter settings used (eg mesh size, interpolation schemes, quantization levels, etc.) are needed to achieve the levels of agreement observed in the other features.
There are several limitations of this study. This study does not address all sources of variation in the calculation of radiomic features. For example, this study did not address issues related to image segmentation and the variability introduced when different VOIs are identified for a given object (7) or the effects of different image acquisition and reconstruction conditions on feature variability (16, 17). This study also does not address the very important question of which features actually provide information relating to a specific clinical task; for example, this study did not address which features are useful in discriminating benign from malignant lesions or responders from nonresponders in clinical trials. As noted previously, some features may be shown to be very stable over a wide range of segmentation or acquisition and reconstruction conditions (eg feature value may be constant), but may not necessarily be useful in performing a given clinical task such as differentiating between benign and malignant lesions. This investigation was also limited to individual radiomic feature values and did not extend to investigations into “feature signatures” or functions that use multiple radiomic features, which would of course depend on the calculation of these underlying features. While evaluating the robustness of radiomic signatures was beyond the scope of the current investigation, this would be a very important next step in evaluating the use of combinations of radiomic features in clinical tasks.
In addition, this study investigated a limited number of features that represent a small fraction of the radiomic features described in the literature and specifically in the IBSI reference manual (13). This was done intentionally to keep the study size manageable and to allow thorough investigation of the features that were included in the study. Therefore, the definition and implementation issues for all radiomic features were not all addressed in this study; however, we believe that this exercise is instructive in identifying some underlying issues that need to be addressed with regard to the reproducibility of radiomic features.
It should be noted that 3 different sites all submitted results using the open source software package PyRadiomics (10, 23). This may have led to somewhat artificially low CV (CV%) results in some instances. This concern is somewhat offset by the observation that several features were observed to have very low (CV < 1%) across all software packages. On the other hand, even these 3 sites did not achieve complete agreement for the GLCM entropy feature, despite the harmonization steps taken, although the agreement was better than that across all packages (mean CV was 10% for just PyRadiomics, compared to 19% observed across all packages).
The sites that used PyRadiomics noted that the package is built nightly, which may result in slightly different versions of the code and a potential source of variation. Moreover, because PyRadiomics can use both DSO and NIfTI segmentation objects (and 2 of our sites used DSOs while 1 used NIfTI), this may contribute to slight variations in feature values.
In addition, there were several lessons learned by the participating sites in this study. Many of these relate to assumptions that sites (and specific software packages) have made that are often implicit or are the result of implementation of certain mathematical operations or parameters for calculation of a feature by a site or in a specific package. These variations are difficult to identify without going through an exercise such as using a reference set of images (both DROs and patient data sets) and comparing results. These lessons include the following:
When calculating volume, slice spacing should be used rather than slice thickness [DICOM field (0018, 0050)]. Slice spacing is not a specific DICOM field, but can be calculated as the difference in either the z location of adjacent images using (0020, 0032) or (0020, 1041). This can be used to calculate features such as approximate volume. Using slice thickness can lead to overestimation of volume values in some cases. In many CT studies, the slice spacing and slice thickness values may be the same (eg, 2.5-mm-thick images spaced 2.5 mm apart), but the online supplemental Appendix shows that 6 of our 10 patient cases had slice spacing values different from the slice thickness. In these cases, using slice thickness instead of spacing can lead to erroneous volume values.
Identification of when interpolation of image data is taking place. In some cases, this interpolation takes place explicitly when image data are read in, and in other cases, it may occur implicitly or as part of the calculation of a feature. The interpolation as well as the interpolation scheme (eg, nearest neighbor, trilinear, tricubic convolution, tricubic spline interpolation, etc.) can lead to variations in feature values.
The methods for interpretation of VOI boundaries and the management of “holes” within a VOI (eg whether the software would “fill the hole” or leave it “as is”) could be potential sources of variance between software packages.
There were also several feature-specific issues. For example, when forming GLCM matrix, there were differences between software packages in whether both the source and destination voxels are checked to be within VOI or if just the source voxel was checked. Another example was that the surface area feature is defined by a mesh-based representation in the IBSI reference manual (Section 3.1.3 of that manual), but our sites reporting feature values may not use that same approach. In addition, even with the mesh-based representation, there still is variability introduced by the selection of parameters such as the mesh size.
Finally, the GLCM entropy feature was specifically chosen for analysis in this study to be representative of the complexity of commonly used texture features in radiomics studies. This feature has several parameters and approaches involved in its calculation (as identified above), and the choice of each can contribute to variability. Our results showed that some issues could be mitigated by using similar parameters, but others may require re-coding of the software. In general, such complex features can be replicated universally only if a common software package (eg, PyRadiomics or similar) is used. However, if such an approach is considered, the community should thoroughly evaluate the specific definitions and implementations of the features in the potential common package and reach consensus that it is acceptable as the “gold standard” before recommending its use.
Based on the initial purpose of this study, its results and the lessons learned from this study, this group recommends the following to improve the reproducibility of results:
That authors provide detailed description of the image analysis, image pre-processing, and feature definitions in future studies reporting results involving radiomic features so that an independent site could reproduce the results of the study.
That IBSI feature definitions be used in manuscripts wherever possible, including the feature coding descriptors whenever possible.
That DICOM develop a DICOM standard for reporting features according to the definitions and features described in the IBSI reference manual.
That reporting structures support flexibility that will allow additional information regarding feature descriptions (in case the IBSI definition is not quite complete) and will allow for further development of future features to be included.
That publication of results should include both the software name and version number. Although this might be difficult for open source software packages that undergo frequent builds and releases, we encourage the description of the used version in as much detail as possible.
That radiomics software packages follow unique release version numbering.
That software source code be released whenever possible (as has been done by several packages used in this study—see Table 1). Mechanisms such as GitHub have proven to be extremely useful in this regard.
From this study, we are able to conclude that some radiomic features (approximate volume, mean intensity, standard deviation, kurtosis with Fisher adjustment, and 3D diameter) can be calculated across different software packages and achieve high levels of agreement. However, we were not able to show the same level of agreement for other features such as surface area, sphericity, 2D diameter, and GLCM entropy. For these features we observed larger variability either due to variations in user selections (2D diameter), implementation approach, or parameter selections. This indicates that even when a reference set of radiomic feature definitions is used, additional information may be required to reproduce results obtained with different software packages. Thus, efforts to further define options in how radiomic features are calculated as well as encouragement in reporting these details can help improve one aspect of radiomic feature variability and ultimately contribute to the reproducibility of these features as they are introduced to clinical trials and ultimately to clinical practice.